[GitHub] [spark] sadhen edited a comment on pull request #31735: [SPARK-34799][PYTHON][SQL] Return User-defined types from Pandas UDF

GitBox Wed, 31 Mar 2021 01:37:15 -0700


sadhen edited a comment on pull request #31735:
URL: https://github.com/apache/spark/pull/31735#issuecomment-810886187

> @HyukjinKwon:
> Furthermore, we will probably have to do it for toPandas and
createDataFrame with Arrow optimization on. It should be best to think about
these cases as well.

toPandas and createDataFrame is supported in the latest commits. See
https://github.com/apache/spark/pull/31735/commits/dca35dfa07449d76c5e940683c846cda984787f6

Just learned about the SPIP vote by you:
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-Support-pandas-API-layer-on-PySpark-td30996.html

I wonder if there are any overlap for this PR and the SPIP. I'm new to
PySpark. My previous experience/contribution of Apache Spark mainly focused on
Yarn/SQL. If there are any overlap or conflicts, could give any feedback.

> @maropu
> I think it is better to make the PR simpler, so how about focusing on
supporting a simple case (@pandas_udf(UDT)) first ? The improvement to support
more types (arrays of UDT, ....) can be done in follow-up PRs, I think.

Thanks for your reply and suggestion. Supporting `@pandas_udf(UDF)` first
seems to be a good way to split this PR. For this PR, I think the most
complicated part lies in `python/pyspark/sql/pandas/serializers.py`. The
currently implementation with Spark DataType and Arrow DataType mixed decreases
the code readibility. For this PR, if CI passed,
https://github.com/apache/spark/pull/31735/commits/dca35dfa07449d76c5e940683c846cda984787f6
would be the last commit.

And I will try to submit a good and small/minimum first splitted PR for you
to review with a better and cleaner implemenation of
`python/pyspark/sql/pandas/seriealizer.py`.

Here is my plan:
1. [SPARK-34711](https://issues.apache.org/jira/browse/SPARK-34771): Support
UDT for Pandas/Spark conversion with Arrow support Enabled
2. [SPARK-34799](https://issues.apache.org/jira/browse/SPARK-34799): Return
User-defined types from Pandas UDF case 1: `@pandas_udf(UDT)`

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] sadhen edited a comment on pull request #31735: [SPARK-34799][PYTHON][SQL] Return User-defined types from Pandas UDF

Reply via email to