Hi,

after upgrading from 2.3.2 to 2.4.0 we recognized a regression when using
posexplode() in conjunction with select of another struct fields.

Given a structure like this:
=============================
>>> df = (spark.range(1)
...     .withColumn("my_arr", array(lit("1"), lit("2")))
...     .withColumn("bar", lit("1"))
...     .select("id", "my_arr", struct("bar").alias("foo"))
... )
>>>
>>> df.printSchema()
root
 |-- id: long (nullable = false)
 |-- my_arr: array (nullable = false)
 |    |-- element: string (containsNull = false)
 |-- foo: struct (nullable = false)
 |    |-- bar: string (nullable = false)



Spark 2.3.2
===========
>>>
>>> df = df.select(posexplode("my_arr"), "foo.bar")
>>>
>>> df.printSchema()
root
 |-- pos: integer (nullable = false)
 |-- col: string (nullable = false)
 |-- bar: string (nullable = false)


selecting "foo.bar" results in field "bar".


Spark 2.4.0
===========
>>>
>>> df = df.select(posexplode("my_arr"), "foo.bar")
>>>
>>> df.printSchema()
root
 |-- pos: integer (nullable = false)
 |-- col: string (nullable = false)
 |-- foo.bar: string (nullable = false)


In 2.4 'bar' now gets 'foo.bar', which is not what we would expect.

So existing code having .select("bar") will fail.

>>> df.select("bar").show()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File
"/home/andreas/Downloads/spark-2.4.0-bin-hadoop2.7/python/pyspark/sql/dataframe.py",
line 1320, in select
    jdf = self._jdf.select(self._jcols(*cols))
  File
"/home/andreas/Downloads/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
line 1257, in __call__
  File
"/home/andreas/Downloads/spark-2.4.0-bin-hadoop2.7/python/pyspark/sql/utils.py",
line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"cannot resolve '`bar`' given input
columns: [pos, col, foo.bar];;\n'Project ['bar]\n+- Project [pos#14,
col#15, foo#9.bar AS foo.bar#16]\n   +- Generate posexplode(my_arr#2),
false, [pos#14, col#15]\n      +- Project [id#0L, my_arr#2,
named_struct(bar, bar#5) AS foo#9]\n         +- Project [id#0L, my_arr#2, 1
AS bar#5]\n            +- Project [id#0L, array(1, 2) AS my_arr#2]\n
       +- Range (0, 1, step=1, splits=Some(4))\n"


Is this a known issue / intended behavior?

Regards
Andreas

Reply via email to