[ https://issues.apache.org/jira/browse/SPARK-32294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17165084#comment-17165084 ]
Selvam Raman commented on SPARK-32294: -------------------------------------- [~hyukjin.kwon] and [~Tagar] do you have any update on the Jira. As per PyArrow Jira ticket, it claimed that problem may be part of spark. > GroupedData Pandas UDF 2Gb limit > -------------------------------- > > Key: SPARK-32294 > URL: https://issues.apache.org/jira/browse/SPARK-32294 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 3.0.0, 3.1.0 > Reporter: Ruslan Dautkhanov > Priority: Major > > `spark.sql.execution.arrow.maxRecordsPerBatch` is not respected for > GroupedData, the whole group is passed to Pandas UDF at once, which can cause > various 2Gb limitations on Arrow side (and in current versions of Arrow, also > 2Gb limitation on Netty allocator side) - > https://issues.apache.org/jira/browse/ARROW-4890 > Would be great to consider feeding GroupedData into a pandas UDF in batches > to solve this issue. > cc [~hyukjin.kwon] > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org