sadhen edited a comment on pull request #31735: URL: https://github.com/apache/spark/pull/31735#issuecomment-810886187
> @HyukjinKwon: > Furthermore, we will probably have to do it for toPandas and createDataFrame with Arrow optimization on. It should be best to think about these cases as well. toPandas and createDataFrame is supported in the latest commits. See https://github.com/apache/spark/pull/31735/commits/dca35dfa07449d76c5e940683c846cda984787f6 Just learned about the SPIP vote by you: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-Support-pandas-API-layer-on-PySpark-td30996.html I wonder if there are any overlap for this PR and the SPIP. I'm new to PySpark. My previous experience/contribution of Apache Spark mainly focused on Yarn/SQL. If there are any overlap or conflicts, could give any feedback. > @maropu > I think it is better to make the PR simpler, so how about focusing on supporting a simple case (@pandas_udf(UDT)) first ? The improvement to support more types (arrays of UDT, ....) can be done in follow-up PRs, I think. Thanks for your reply and suggestion. Supporting `@pandas_udf(UDF)` first seems to be a good way to split this PR. For this PR, I think the most complicated part lies in `python/pyspark/sql/pandas/serializers.py`. The currently implementation with Spark DataType and Arrow DataType mixed decreases the code readibility. For this PR, if CI passed, https://github.com/apache/spark/pull/31735/commits/dca35dfa07449d76c5e940683c846cda984787f6 would be the last commit. And I will try to submit a good and small/minimum first splitted PR for you to review with a better and cleaner implemenation of `python/pyspark/sql/pandas/seriealizer.py`. Here is my plan: 1. [SPARK-34711](https://issues.apache.org/jira/browse/SPARK-34771): Support UDT for Pandas/Spark conversion with Arrow support Enabled 2. [SPARK-34799](https://issues.apache.org/jira/browse/SPARK-34799): Return User-defined types from Pandas UDF case 1: `@pandas_udf(UDT)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org