sadhen edited a comment on pull request #31735:
URL: https://github.com/apache/spark/pull/31735#issuecomment-810886187


   > @HyukjinKwon:
   > Furthermore, we will probably have to do it for toPandas and 
createDataFrame with Arrow optimization on. It should be best to think about 
these cases as well.
   
   toPandas and createDataFrame is supported in the latest commits. See 
https://github.com/apache/spark/pull/31735/commits/dca35dfa07449d76c5e940683c846cda984787f6
   
   Just learned about the SPIP vote by you: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-Support-pandas-API-layer-on-PySpark-td30996.html
   
   I wonder if there are any overlap for this PR and the SPIP. I'm new to 
PySpark. My previous experience/contribution of Apache Spark mainly focused on 
Yarn/SQL. If there are any overlap or conflicts, could give any feedback.
   
   > @maropu 
   > I think it is better to make the PR simpler, so how about focusing on 
supporting a simple case (@pandas_udf(UDT)) first ? The improvement to support 
more types (arrays of UDT, ....) can be done in follow-up PRs, I think.
   
   Thanks for your reply and suggestion. Supporting `@pandas_udf(UDF)` first 
seems to be a good way to split this PR. For this PR, I think the most 
complicated part lies in `python/pyspark/sql/pandas/serializers.py`. The 
currently implementation with Spark DataType and Arrow DataType mixed decreases 
the code readibility. For this PR, if CI passed,  
https://github.com/apache/spark/pull/31735/commits/dca35dfa07449d76c5e940683c846cda984787f6
 would be the last commit.
   
   And I will try to submit a good and small/minimum first splitted PR for you 
to review with a better and cleaner implemenation of 
`python/pyspark/sql/pandas/seriealizer.py`.
   
   Here is my plan:
   1. [SPARK-34711](https://issues.apache.org/jira/browse/SPARK-34771): Support 
UDT for Pandas/Spark conversion with Arrow support Enabled
   2. [SPARK-34799](https://issues.apache.org/jira/browse/SPARK-34799): Return 
User-defined types from Pandas UDF case 1: `@pandas_udf(UDT)`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to