[jira] [Commented] (SPARK-32294) GroupedData Pandas UDF 2Gb limit

Selvam Raman (Jira) Sat, 25 Jul 2020 10:51:21 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-32294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17165084#comment-17165084
 ]


Selvam Raman commented on SPARK-32294:
--------------------------------------

[~hyukjin.kwon] and [~Tagar] do you have any update on the Jira. 
As per PyArrow Jira ticket, it claimed that problem may be part of spark.

> GroupedData Pandas UDF 2Gb limit
> --------------------------------
>
>                 Key: SPARK-32294
>                 URL: https://issues.apache.org/jira/browse/SPARK-32294
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.0.0, 3.1.0
>            Reporter: Ruslan Dautkhanov
>            Priority: Major
>
> `spark.sql.execution.arrow.maxRecordsPerBatch` is not respected for 
> GroupedData, the whole group is passed to Pandas UDF at once, which can cause 
> various 2Gb limitations on Arrow side (and in current versions of Arrow, also 
> 2Gb limitation on Netty allocator side) - 
> https://issues.apache.org/jira/browse/ARROW-4890 
> Would be great to consider feeding GroupedData into a pandas UDF in batches 
> to solve this issue. 
> cc [~hyukjin.kwon] 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32294) GroupedData Pandas UDF 2Gb limit

Reply via email to