[ 
https://issues.apache.org/jira/browse/SPARK-34544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292987#comment-17292987
 ] 

Rafal Wojdyla edited comment on SPARK-34544 at 3/1/21, 4:40 PM:
----------------------------------------------------------------

👋 [~zero323]
{quote}
it is more a dev utility than user a facing feature.
{quote}
We use mypy to type check our codebase, and we hit this issue as users, for an 
example of an issue see SPARK-34540 (which is just one case). Btw I could not 
find any documentation for the pyspark typing contributions (like in what cases 
new symbols should be added to public protocols, and why protocols are 
incomplete etc), I probably missed it, could you please point me towards it?
{quote}
As far as I am aware removing it doesn't resolve any of the problems described 
here
{quote}
Removing the {{DataFrameLike}} as the return type of the {{toPandas}}, would 
make mypy stop shouting about missing symbols (which are not part of 
{{DataFrameLike}}, but are in fact valid methods of pandas' {{DataFrame}}). 
This is obviously suboptimal since then it just becomes {{Any}}. An alternative 
is to add the missing symbols to the {{DataFrameLike}} as in SPARK-34540. But 
until pyspark release, how would we monkey patch that change in our projects?

So in the end it sounds like we have a bunch of suboptimal ideas, how should we 
proceed?


was (Author: ravwojdyla):
👋 [~zero323]

> it is more a dev utility than user a facing feature.

We use mypy to type check our codebase, and we hit this issue as users, for an 
example of an issue see SPARK-34540 (which is just one case). Btw I could not 
find any documentation for the pyspark typing contributions (like in what cases 
new symbols should be added to public protocols, and why protocols are 
incomplete etc), I probably missed it, could you please point me towards it?

> As far as I am aware removing it doesn't resolve any of the problems 
> described here

Removing the {{DataFrameLike}} as the return type of the {{toPandas}}, would 
make mypy stop shouting about missing symbols (which are not part of 
{{DataFrameLike}}, but are in fact valid methods of pandas' {{DataFrame}}). 
This is obviously suboptimal since then it just becomes {{Any}}. An alternative 
is to add the missing symbols to the {{DataFrameLike}} as in SPARK-34540. But 
until pyspark release, how would we monkey patch that change in our projects?

So in the end it sounds like we have a bunch of suboptimal ideas, how should we 
proceed?

> pyspark toPandas() should return pd.DataFrame
> ---------------------------------------------
>
>                 Key: SPARK-34544
>                 URL: https://issues.apache.org/jira/browse/SPARK-34544
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.1.1
>            Reporter: Rafal Wojdyla
>            Assignee: Maciej Szymkiewicz
>            Priority: Major
>
> Right now {{toPandas()}} returns {{DataFrameLike}}, which is an incomplete 
> "view" of pandas {{DataFrame}}. Which leads to cases like mypy reporting that 
> certain pandas methods are not present in {{DataFrameLike}}, even tho those 
> methods are valid methods on pandas {{DataFrame}}, which is the actual type 
> of the object. This requires type ignore comments or asserts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to