[ https://issues.apache.org/jira/browse/SPARK-32347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17160619#comment-17160619 ]
Apache Spark commented on SPARK-32347: -------------------------------------- User 'TJX2014' has created a pull request for this issue: https://github.com/apache/spark/pull/29156 > BROADCAST hint makes a weird message that "column can't be resolved" (it was > OK in Spark 2.4) > --------------------------------------------------------------------------------------------- > > Key: SPARK-32347 > URL: https://issues.apache.org/jira/browse/SPARK-32347 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.0 > Environment: Spark 3.0.0, jupyter notebook, spark launched in > local[4] mode, but with Standalone cluster it also fails the same way. > > > Reporter: Ihor Bobak > Priority: Major > Fix For: 3.0.1 > > Attachments: 2020-07-17 17_46_32-Window.png, 2020-07-17 > 17_49_27-Window.png, 2020-07-17 17_52_51-Window.png > > > The bug is very easily reproduced: run the following same code in Spark > 2.4.3. and in 3.0.0. > The SQL parser will raise an invalid error message with 3.0.0, although > everything seems to be OK with the SQL statement and it works fine in Spark > 2.4.3 > {code:python} > import pandas as pd > pdf_sales = pd.DataFrame([(1, 10), (2, 20)], columns=["BuyerID", "Qty"]) > pdf_buyers = pd.DataFrame([(1, "John"), (2, "Jack")], columns=["BuyerID", > "BuyerName"]) > df_sales = spark.createDataFrame(pdf_sales) > df_buyers = spark.createDataFrame(pdf_buyers) > df_sales.createOrReplaceTempView("df_sales") > df_buyers.createOrReplaceTempView("df_buyers") > spark.sql(""" > with b as ( > select /*+ BROADCAST(df_buyers) */ > BuyerID, BuyerName > from df_buyers > ) > select > b.BuyerID, > b.BuyerName, > s.Qty > from df_sales s > inner join b on s.BuyerID = b.BuyerID > """).toPandas() > {code} > The (wrong) error message: > --------------------------------------------------------------------------- > AnalysisException Traceback (most recent call last) > <ipython-input-4-8dfe318a59ee> in <module> > 22 from df_sales s > 23 inner join b on s.BuyerID = b.BuyerID > ---> 24 """).toPandas() > /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/session.py in > sql(self, sqlQuery) > 644 [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, > f2=u'row3')] > 645 """ > --> 646 return DataFrame(self._jsparkSession.sql(sqlQuery), > self._wrapped) > 647 > 648 @since(2.0) > /opt/spark-3.0.0-bin-without-hadoop/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py > in __call__(self, *args) > 1303 answer = self.gateway_client.send_command(command) > 1304 return_value = get_return_value( > -> 1305 answer, self.gateway_client, self.target_id, self.name) > 1306 > 1307 for temp_arg in temp_args: > /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in deco(*a, > **kw) > 135 # Hide where the exception came from that shows a > non-Pythonic > 136 # JVM exception message. > --> 137 raise_from(converted) > 138 else: > 139 raise > /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in > raise_from(e) > AnalysisException: cannot resolve '`s.BuyerID`' given input columns: > [s.BuyerID, b.BuyerID, b.BuyerName, s.Qty]; line 12 pos 24; > 'Project ['b.BuyerID, 'b.BuyerName, 's.Qty] > +- 'Join Inner, ('s.BuyerID = 'b.BuyerID) > :- SubqueryAlias s > : +- SubqueryAlias df_sales > : +- LogicalRDD [BuyerID#23L, Qty#24L], false > +- SubqueryAlias b > +- Project [BuyerID#27L, BuyerName#28] > +- SubqueryAlias df_buyers > +- LogicalRDD [BuyerID#27L, BuyerName#28], false -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org