[ https://issues.apache.org/jira/browse/SPARK-31186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon updated SPARK-31186: --------------------------------- Fix Version/s: 2.4.6 > toPandas fails on simple query (collect() works) > ------------------------------------------------ > > Key: SPARK-31186 > URL: https://issues.apache.org/jira/browse/SPARK-31186 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.4.4 > Reporter: Michael Chirico > Assignee: L. C. Hsieh > Priority: Minor > Fix For: 3.0.0, 2.4.6 > > > My pandas is 0.25.1. > I ran the following simple code (cross joins are enabled): > {code:python} > spark.sql(''' > select t1.*, t2.* from ( > select explode(sequence(1, 3)) v > ) t1 left join ( > select explode(sequence(1, 3)) v > ) t2 > ''').toPandas() > {code} > and got a ValueError from pandas: > > ValueError: The truth value of a Series is ambiguous. Use a.empty, > > a.bool(), a.item(), a.any() or a.all(). > Collect works fine: > {code:python} > spark.sql(''' > select * from ( > select explode(sequence(1, 3)) v > ) t1 left join ( > select explode(sequence(1, 3)) v > ) t2 > ''').collect() > # [Row(v=1, v=1), > # Row(v=1, v=2), > # Row(v=1, v=3), > # Row(v=2, v=1), > # Row(v=2, v=2), > # Row(v=2, v=3), > # Row(v=3, v=1), > # Row(v=3, v=2), > # Row(v=3, v=3)] > {code} > I imagine it's related to the duplicate column names, but this doesn't fail: > {code:python} > spark.sql("select 1 v, 1 v").toPandas() > # v v > # 0 1 1 > {code} > Also no issue for multiple rows: > spark.sql("select 1 v, 1 v union all select 1 v, 2 v").toPandas() > It also works when not using a cross join but a janky > programatically-generated union all query: > {code:python} > cond = [] > for ii in range(3): > for jj in range(3): > cond.append(f'select {ii+1} v, {jj+1} v') > spark.sql(' union all '.join(cond)).toPandas() > {code} > As near as I can tell, the output is identical to the explode output, making > this issue all the more peculiar, as I thought toPandas() is applied to the > output of collect(), so if collect() gives the same output, how can > toPandas() fail in one case and not the other? Further, the lazy DataFrame is > the same: DataFrame[v: int, v: int] in both cases. I must be missing > something. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org