maximethebault opened a new issue, #6224:
URL: https://github.com/apache/iceberg/issues/6224
### Apache Iceberg version
1.0.0 (latest release)
### Query engine
Spark
### Please describe the bug 🐞
After upgrading to Iceberg 1.0.0 & Spark 3.3.1 (from 0.13.x & 3.2.x), some
of our SQL queries stopped working.
We suspect it may be a Iceberg-related issue as we couldn't reproduce the
issue with Hive tables.
### Stripped-down reproducer
Set-up tables & views
```
val table1 = Seq(("204")).toDF("id")
table1.createOrReplaceTempView("table1")
val table2_1 = Seq(("204")).toDF("id")
table2_1.writeTo("dev.table2_1").using("iceberg").createOrReplace()
val table2_2 = Seq(("204")).toDF("id")
table2_2.createOrReplaceTempView("table2_2")
val table2 = spark.table("dev.table2_1").union(spark.table("table2_2"))
table2.createOrReplaceTempView("table2")
```
Run query
```
SELECT
u.*
FROM
table1
LEFT JOIN
(
SELECT
id
FROM
table1
LEFT JOIN
table2
USING(id)
) u
USING(id)
```
Results in an exception:
```
java.lang.IllegalArgumentException: requirement failed
at scala.Predef$.require(Predef.scala:268)
at
org.apache.spark.sql.catalyst.plans.logical.View.<init>(basicLogicalOperators.scala:569)
at
org.apache.spark.sql.catalyst.plans.logical.View.copy(basicLogicalOperators.scala:568)
at
org.apache.spark.sql.catalyst.plans.logical.View.withNewChildInternal(basicLogicalOperators.scala:604)
at
org.apache.spark.sql.catalyst.plans.logical.View.withNewChildInternal(basicLogicalOperators.scala:565)
at
org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal(TreeNode.scala:1242)
at
org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal$(TreeNode.scala:1240)
at
org.apache.spark.sql.catalyst.plans.logical.View.withNewChildrenInternal(basicLogicalOperators.scala:565)
at
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$withNewChildren$2(TreeNode.scala:462)
at
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
at
org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:461)
at
org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.org$apache$spark$sql$catalyst$analysis$Analyzer$AddMetadataColumns$$addMetadataCol(Analyzer.scala:975)
at
org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$addMetadataCol$1(Analyzer.scala:975)
```
### Further investigation
If I replace "USING" with classical "ON" clauses, the exception is not
thrown.
I think this issue is caused by the fact I'm mixing Iceberg & non-Iceberg
tables in the UNION clause.
If I inline table2 in the query, I get a different exception:
```
SELECT
u.*
FROM
table1
LEFT JOIN
(
SELECT
id
FROM
table1
LEFT JOIN
((SELECT id id FROM dev.table2_1 limit 1) UNION (SELECT id FROM
table2_2))
USING(id)
) u
USING(id)
```
results in:
```
org.apache.spark.sql.AnalysisException: Union can only be performed on
tables with the same number of columns, but the first table has 6 columns and
the second table has 1 columns;
'Project [id#1302]
+- 'Project [id#1302, id#1302]
+- 'Project [id#1302, id#998]
+- 'Join LeftOuter, (id#998 = id#1302)
:- SubqueryAlias table1
: +- View (`table1`, [id#998])
: +- Project [value#995 AS id#998]
: +- LocalRelation [value#995]
+- 'SubqueryAlias u
+- 'Project [id#1294, id#1302]
+- 'Project [id#1294, id#1302]
+- 'Join LeftOuter, (id#1302 = id#1294)
:- SubqueryAlias table1
: +- View (`table1`, [id#1302])
: +- Project [value#1296 AS id#1302]
: +- LocalRelation [value#1296]
+- 'SubqueryAlias __auto_generated_subquery_name
+- 'Distinct
+- 'Union false, false
:- GlobalLimit 1
: +- LocalLimit 1
: +- Project [_spec_id#1297,
_partition#1298, _file#1299, _pos#1300L, _deleted#1301, id#1295 AS id#1294]
: +- SubqueryAlias
spark_catalog.dev.table2_1
: +- RelationV2[id#1295,
_spec_id#1297, _partition#1298, _file#1299, _pos#1300L, _deleted#1301]
spark_catalog.dev.table2_1
+- Project [id#1011]
+- SubqueryAlias table2_2
+- View (`table2_2`, [id#1011])
+- Project [value#1008 AS id#1011]
+- LocalRelation [value#1008]
```
It looks like some Iceberg metadata columns are visible to Spark during the
query analysis and I'm not sure they are supposed to.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]