Dmitry Kravchuk created SPARK-34588: ---------------------------------------
Summary: Support int64 buffer lengths in Java for pyspark Pandas UDF as buffer expanding Key: SPARK-34588 URL: https://issues.apache.org/jira/browse/SPARK-34588 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.0.2 Environment: Hadoop part: * spark 3.0.2 * java 1.8.0_77 * scala 2.12.10 Python part: * cython 0.29.22 * numpy 1.19.5 * pandas 1.1.5 * pyarrow 2.0.0 Reporter: Dmitry Kravchuk Fix For: 3.1.0, 3.2.0, 3.1.1, 3.1.2, 3.0.3 This issue is an extention of [arrow issue|https://issues.apache.org/jira/browse/ARROW-10957#] for making possible using pyspark Pandas UDF functions for data more than 2gb per data group. Here is the deal - arrow [supports |https://github.com/apache/arrow/commit/9742007c463e253e2b916e65f668146953456a00#diff-2e086b32ec292aae20695dd4341c647c9a9d7d3d77816bf849f7fbf68e9fa6cfR209]long type for data serialization between java and python but spark doesn't. It gives a lot of problem when somebody is trying to apply Pandas UDF for dataset where any group is more than 2^32(-1) bytes what is equal to 2gb. Solving this problem will help to use more data per Pandas UDF groupping - 2^64(-1) bytes. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org