[ 
https://issues.apache.org/jira/browse/SPARK-31186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31186:
---------------------------------
    Fix Version/s: 2.4.6

> toPandas fails on simple query (collect() works)
> ------------------------------------------------
>
>                 Key: SPARK-31186
>                 URL: https://issues.apache.org/jira/browse/SPARK-31186
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.4.4
>            Reporter: Michael Chirico
>            Assignee: L. C. Hsieh
>            Priority: Minor
>             Fix For: 3.0.0, 2.4.6
>
>
> My pandas is 0.25.1.
> I ran the following simple code (cross joins are enabled):
> {code:python}
> spark.sql('''
> select t1.*, t2.* from (
>   select explode(sequence(1, 3)) v
> ) t1 left join (
>   select explode(sequence(1, 3)) v
> ) t2
> ''').toPandas()
> {code}
> and got a ValueError from pandas:
> > ValueError: The truth value of a Series is ambiguous. Use a.empty, 
> > a.bool(), a.item(), a.any() or a.all().
> Collect works fine:
> {code:python}
> spark.sql('''
> select * from (
>   select explode(sequence(1, 3)) v
> ) t1 left join (
>   select explode(sequence(1, 3)) v
> ) t2
> ''').collect()
> # [Row(v=1, v=1),
> #  Row(v=1, v=2),
> #  Row(v=1, v=3),
> #  Row(v=2, v=1),
> #  Row(v=2, v=2),
> #  Row(v=2, v=3),
> #  Row(v=3, v=1),
> #  Row(v=3, v=2),
> #  Row(v=3, v=3)]
> {code}
> I imagine it's related to the duplicate column names, but this doesn't fail:
> {code:python}
> spark.sql("select 1 v, 1 v").toPandas()
> # v   v
> # 0   1       1
> {code}
> Also no issue for multiple rows:
> spark.sql("select 1 v, 1 v union all select 1 v, 2 v").toPandas()
> It also works when not using a cross join but a janky 
> programatically-generated union all query:
> {code:python}
> cond = []
> for ii in range(3):
>     for jj in range(3):
>         cond.append(f'select {ii+1} v, {jj+1} v')
> spark.sql(' union all '.join(cond)).toPandas()
> {code}
> As near as I can tell, the output is identical to the explode output, making 
> this issue all the more peculiar, as I thought toPandas() is applied to the 
> output of collect(), so if collect() gives the same output, how can 
> toPandas() fail in one case and not the other? Further, the lazy DataFrame is 
> the same: DataFrame[v: int, v: int] in both cases. I must be missing 
> something.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to