[ https://issues.apache.org/jira/browse/SPARK-26410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16752467#comment-16752467 ]
Xiangrui Meng commented on SPARK-26410: --------------------------------------- There are several possible solutions to this. SPARK-23258 is one. I think it is more reasonable to limit the buffer size instead of number of records per batch, because the latter varies per task. > Support per Pandas UDF configuration > ------------------------------------ > > Key: SPARK-26410 > URL: https://issues.apache.org/jira/browse/SPARK-26410 > Project: Spark > Issue Type: New Feature > Components: PySpark > Affects Versions: 3.0.0 > Reporter: Xiangrui Meng > Priority: Major > > We use a "maxRecordsPerBatch" conf to control the batch sizes. However, the > "right" batch size usually depends on the task itself. It would be nice if > user can configure the batch size when they declare the Pandas UDF. > This is orthogonal to SPARK-23258 (using max buffer size instead of row > count). > Besides API, we should also discuss how to merge Pandas UDFs of different > configurations. For example, > {code} > df.select(predict1(col("features"), predict2(col("features"))) > {code} > when predict1 requests 100 rows per batch, while predict2 requests 120 rows > per batch. > cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator] -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org