[ https://issues.apache.org/jira/browse/SPARK-26410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16726824#comment-16726824 ]
Li Jin commented on SPARK-26410: -------------------------------- One thing we want think about is whether or not to mix different size batches inside a single Eval node. It feels a bit complicated and it's probably a little simpler to group UDFs of the same batch size into one Eval node, with some performance cost. I am not very certain on this and curious about what other people think. > Support per Pandas UDF configuration > ------------------------------------ > > Key: SPARK-26410 > URL: https://issues.apache.org/jira/browse/SPARK-26410 > Project: Spark > Issue Type: New Feature > Components: PySpark > Affects Versions: 3.0.0 > Reporter: Xiangrui Meng > Priority: Major > > We use a "maxRecordsPerBatch" conf to control the batch sizes. However, the > "right" batch size usually depends on the task itself. It would be nice if > user can configure the batch size when they declare the Pandas UDF. > This is orthogonal to SPARK-23258 (using max buffer size instead of row > count). > Besides API, we should also discuss how to merge Pandas UDFs of different > configurations. For example, > {code} > df.select(predict1(col("features"), predict2(col("features"))) > {code} > when predict1 requests 100 rows per batch, while predict2 requests 120 rows > per batch. > cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator] -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org