Hi, after upgrading from 2.3.2 to 2.4.0 we recognized a regression when using posexplode() in conjunction with select of another struct fields.
Given a structure like this: ============================= >>> df = (spark.range(1) ... .withColumn("my_arr", array(lit("1"), lit("2"))) ... .withColumn("bar", lit("1")) ... .select("id", "my_arr", struct("bar").alias("foo")) ... ) >>> >>> df.printSchema() root |-- id: long (nullable = false) |-- my_arr: array (nullable = false) | |-- element: string (containsNull = false) |-- foo: struct (nullable = false) | |-- bar: string (nullable = false) Spark 2.3.2 =========== >>> >>> df = df.select(posexplode("my_arr"), "foo.bar") >>> >>> df.printSchema() root |-- pos: integer (nullable = false) |-- col: string (nullable = false) |-- bar: string (nullable = false) selecting "foo.bar" results in field "bar". Spark 2.4.0 =========== >>> >>> df = df.select(posexplode("my_arr"), "foo.bar") >>> >>> df.printSchema() root |-- pos: integer (nullable = false) |-- col: string (nullable = false) |-- foo.bar: string (nullable = false) In 2.4 'bar' now gets 'foo.bar', which is not what we would expect. So existing code having .select("bar") will fail. >>> df.select("bar").show() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/andreas/Downloads/spark-2.4.0-bin-hadoop2.7/python/pyspark/sql/dataframe.py", line 1320, in select jdf = self._jdf.select(self._jcols(*cols)) File "/home/andreas/Downloads/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File "/home/andreas/Downloads/spark-2.4.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 69, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u"cannot resolve '`bar`' given input columns: [pos, col, foo.bar];;\n'Project ['bar]\n+- Project [pos#14, col#15, foo#9.bar AS foo.bar#16]\n +- Generate posexplode(my_arr#2), false, [pos#14, col#15]\n +- Project [id#0L, my_arr#2, named_struct(bar, bar#5) AS foo#9]\n +- Project [id#0L, my_arr#2, 1 AS bar#5]\n +- Project [id#0L, array(1, 2) AS my_arr#2]\n +- Range (0, 1, step=1, splits=Some(4))\n" Is this a known issue / intended behavior? Regards Andreas