[jira] [Commented] (SPARK-26410) Support per Pandas UDF configuration
[ https://issues.apache.org/jira/browse/SPARK-26410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16752467#comment-16752467 ] Xiangrui Meng commented on SPARK-26410: --- There are several possible solutions to this. SPARK-23258 is one. I think it is more reasonable to limit the buffer size instead of number of records per batch, because the latter varies per task. > Support per Pandas UDF configuration > > > Key: SPARK-26410 > URL: https://issues.apache.org/jira/browse/SPARK-26410 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > We use a "maxRecordsPerBatch" conf to control the batch sizes. However, the > "right" batch size usually depends on the task itself. It would be nice if > user can configure the batch size when they declare the Pandas UDF. > This is orthogonal to SPARK-23258 (using max buffer size instead of row > count). > Besides API, we should also discuss how to merge Pandas UDFs of different > configurations. For example, > {code} > df.select(predict1(col("features"), predict2(col("features"))) > {code} > when predict1 requests 100 rows per batch, while predict2 requests 120 rows > per batch. > cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26410) Support per Pandas UDF configuration
[ https://issues.apache.org/jira/browse/SPARK-26410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16751771#comment-16751771 ] Bryan Cutler commented on SPARK-26410: -- This could be useful to have, but it does seem a little strange to bind batch size to a udf. To me, batch size seems more related to the data being used, and merging different batch sizes could complicate the behavior. Still, I can see how someone might want to change batch size at different points in a session. > Support per Pandas UDF configuration > > > Key: SPARK-26410 > URL: https://issues.apache.org/jira/browse/SPARK-26410 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > We use a "maxRecordsPerBatch" conf to control the batch sizes. However, the > "right" batch size usually depends on the task itself. It would be nice if > user can configure the batch size when they declare the Pandas UDF. > This is orthogonal to SPARK-23258 (using max buffer size instead of row > count). > Besides API, we should also discuss how to merge Pandas UDFs of different > configurations. For example, > {code} > df.select(predict1(col("features"), predict2(col("features"))) > {code} > when predict1 requests 100 rows per batch, while predict2 requests 120 rows > per batch. > cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26410) Support per Pandas UDF configuration
[ https://issues.apache.org/jira/browse/SPARK-26410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16726824#comment-16726824 ] Li Jin commented on SPARK-26410: One thing we want think about is whether or not to mix different size batches inside a single Eval node. It feels a bit complicated and it's probably a little simpler to group UDFs of the same batch size into one Eval node, with some performance cost. I am not very certain on this and curious about what other people think. > Support per Pandas UDF configuration > > > Key: SPARK-26410 > URL: https://issues.apache.org/jira/browse/SPARK-26410 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > We use a "maxRecordsPerBatch" conf to control the batch sizes. However, the > "right" batch size usually depends on the task itself. It would be nice if > user can configure the batch size when they declare the Pandas UDF. > This is orthogonal to SPARK-23258 (using max buffer size instead of row > count). > Besides API, we should also discuss how to merge Pandas UDFs of different > configurations. For example, > {code} > df.select(predict1(col("features"), predict2(col("features"))) > {code} > when predict1 requests 100 rows per batch, while predict2 requests 120 rows > per batch. > cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26410) Support per Pandas UDF configuration
[ https://issues.apache.org/jira/browse/SPARK-26410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16726820#comment-16726820 ] Li Jin commented on SPARK-26410: Thanks for the explanation. I think it makes sense to have batch size as one of the parameters for Scala Pandas UDF. I also remember seeing cases where the default batch size is too large for a particular column which happens to be a large array column. This sounds similar to your image example. > Support per Pandas UDF configuration > > > Key: SPARK-26410 > URL: https://issues.apache.org/jira/browse/SPARK-26410 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > We use a "maxRecordsPerBatch" conf to control the batch sizes. However, the > "right" batch size usually depends on the task itself. It would be nice if > user can configure the batch size when they declare the Pandas UDF. > This is orthogonal to SPARK-23258 (using max buffer size instead of row > count). > Besides API, we should also discuss how to merge Pandas UDFs of different > configurations. For example, > {code} > df.select(predict1(col("features"), predict2(col("features"))) > {code} > when predict1 requests 100 rows per batch, while predict2 requests 120 rows > per batch. > cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26410) Support per Pandas UDF configuration
[ https://issues.apache.org/jira/browse/SPARK-26410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16726155#comment-16726155 ] Xiangrui Meng commented on SPARK-26410: --- On the same cluster, there could be different workloads. For example, doing model inference over image data is very different from computing summary statistics. Each image record is about 100KB-10MB large, and the batch size doesn't need to be very large to utilize vectorized computation. > Support per Pandas UDF configuration > > > Key: SPARK-26410 > URL: https://issues.apache.org/jira/browse/SPARK-26410 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > We use a "maxRecordsPerBatch" conf to control the batch sizes. However, the > "right" batch size usually depends on the task itself. It would be nice if > user can configure the batch size when they declare the Pandas UDF. > This is orthogonal to SPARK-23258 (using max buffer size instead of row > count). > Besides API, we should also discuss how to merge Pandas UDFs of different > configurations. For example, > {code} > df.select(predict1(col("features"), predict2(col("features"))) > {code} > when predict1 requests 100 rows per batch, while predict2 requests 120 rows > per batch. > cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26410) Support per Pandas UDF configuration
[ https://issues.apache.org/jira/browse/SPARK-26410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16725948#comment-16725948 ] Li Jin commented on SPARK-26410: I am curious why would user want to configure maxRecordsPerBatch? As far as I remember this is just a performance/memory usage conf. In your example, what's the reason that the user wants 100 rows per batch for one UDF and 120 rows per batch for another? > Support per Pandas UDF configuration > > > Key: SPARK-26410 > URL: https://issues.apache.org/jira/browse/SPARK-26410 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > We use a "maxRecordsPerBatch" conf to control the batch sizes. However, the > "right" batch size usually depends on the task itself. It would be nice if > user can configure the batch size when they declare the Pandas UDF. > This is orthogonal to SPARK-23258 (using max buffer size instead of row > count). > Besides API, we should also discuss how to merge Pandas UDFs of different > configurations. For example, > {code} > df.select(predict1(col("features"), predict2(col("features"))) > {code} > when predict1 requests 100 rows per batch, while predict2 requests 120 rows > per batch. > cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org