[ https://issues.apache.org/jira/browse/SPARK-34588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17293645#comment-17293645 ]
Hyukjin Kwon commented on SPARK-34588: -------------------------------------- Isn't it an Arrow side issue not yet resolved? - https://issues.apache.org/jira/browse/ARROW-4890 > Support int64 buffer lengths in Java for pyspark Pandas UDF as buffer > expanding > ------------------------------------------------------------------------------- > > Key: SPARK-34588 > URL: https://issues.apache.org/jira/browse/SPARK-34588 > Project: Spark > Issue Type: Improvement > Components: PySpark > Affects Versions: 3.0.2 > Environment: Hadoop part: > * spark 3.0.2 > * java 1.8.0_77 > * scala 2.12.10 > Python part: > * cython 0.29.22 > * numpy 1.19.5 > * pandas 1.1.5 > * pyarrow 2.0.0 > Reporter: Dmitry Kravchuk > Priority: Major > > This issue is an extention of [arrow > issue|https://issues.apache.org/jira/browse/ARROW-10957#] for making possible > using pyspark Pandas UDF functions for data more than 2gb per data group. > Here is the deal - arrow [supports > |https://github.com/apache/arrow/commit/9742007c463e253e2b916e65f668146953456a00#diff-2e086b32ec292aae20695dd4341c647c9a9d7d3d77816bf849f7fbf68e9fa6cfR209]long > type for data serialization between java and python but spark doesn't. It > gives a lot of problem when somebody is trying to apply Pandas UDF for > dataset where any group is more than 2^32(-1) bytes what is equal to 2gb. > Solving this problem will help to use more data per Pandas UDF groupping - > 2^64(-1) bytes. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org