[jira] [Commented] (SPARK-26410) Support per Pandas UDF configuration

2019-01-25 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16752467#comment-16752467
 ] 

Xiangrui Meng commented on SPARK-26410:
---

There are several possible solutions to this. SPARK-23258 is one. I think it is 
more reasonable to limit the buffer size instead of number of records per 
batch, because the latter varies per task.

> Support per Pandas UDF configuration
> 
>
> Key: SPARK-26410
> URL: https://issues.apache.org/jira/browse/SPARK-26410
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> We use a "maxRecordsPerBatch" conf to control the batch sizes. However, the 
> "right" batch size usually depends on the task itself. It would be nice if 
> user can configure the batch size when they declare the Pandas UDF.
> This is orthogonal to SPARK-23258 (using max buffer size instead of row 
> count).
> Besides API, we should also discuss how to merge Pandas UDFs of different 
> configurations. For example,
> {code}
> df.select(predict1(col("features"), predict2(col("features")))
> {code}
> when predict1 requests 100 rows per batch, while predict2 requests 120 rows 
> per batch.
> cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26410) Support per Pandas UDF configuration

2019-01-24 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16751771#comment-16751771
 ] 

Bryan Cutler commented on SPARK-26410:
--

This could be useful to have, but it does seem a little strange to bind batch 
size to a udf. To me, batch size seems more related to the data being used, and 
merging different batch sizes could complicate the behavior. Still, I can see 
how someone might want to change batch size at different points in a session.

> Support per Pandas UDF configuration
> 
>
> Key: SPARK-26410
> URL: https://issues.apache.org/jira/browse/SPARK-26410
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> We use a "maxRecordsPerBatch" conf to control the batch sizes. However, the 
> "right" batch size usually depends on the task itself. It would be nice if 
> user can configure the batch size when they declare the Pandas UDF.
> This is orthogonal to SPARK-23258 (using max buffer size instead of row 
> count).
> Besides API, we should also discuss how to merge Pandas UDFs of different 
> configurations. For example,
> {code}
> df.select(predict1(col("features"), predict2(col("features")))
> {code}
> when predict1 requests 100 rows per batch, while predict2 requests 120 rows 
> per batch.
> cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26410) Support per Pandas UDF configuration

2018-12-21 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16726824#comment-16726824
 ] 

Li Jin commented on SPARK-26410:


One thing we want think about is whether or not to mix different size batches 
inside a single Eval node. It feels a bit complicated and it's probably a 
little simpler to group UDFs of the same batch size into one Eval node, with 
some performance cost. I am not very certain on this and curious about what 
other people think.

> Support per Pandas UDF configuration
> 
>
> Key: SPARK-26410
> URL: https://issues.apache.org/jira/browse/SPARK-26410
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> We use a "maxRecordsPerBatch" conf to control the batch sizes. However, the 
> "right" batch size usually depends on the task itself. It would be nice if 
> user can configure the batch size when they declare the Pandas UDF.
> This is orthogonal to SPARK-23258 (using max buffer size instead of row 
> count).
> Besides API, we should also discuss how to merge Pandas UDFs of different 
> configurations. For example,
> {code}
> df.select(predict1(col("features"), predict2(col("features")))
> {code}
> when predict1 requests 100 rows per batch, while predict2 requests 120 rows 
> per batch.
> cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26410) Support per Pandas UDF configuration

2018-12-21 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16726820#comment-16726820
 ] 

Li Jin commented on SPARK-26410:


Thanks for the explanation. I think it makes sense to have batch size as one of 
the parameters for Scala Pandas UDF. 

I also remember seeing cases where the default batch size is too large for a 
particular column which happens to be a large array column. This sounds similar 
to your image example.

> Support per Pandas UDF configuration
> 
>
> Key: SPARK-26410
> URL: https://issues.apache.org/jira/browse/SPARK-26410
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> We use a "maxRecordsPerBatch" conf to control the batch sizes. However, the 
> "right" batch size usually depends on the task itself. It would be nice if 
> user can configure the batch size when they declare the Pandas UDF.
> This is orthogonal to SPARK-23258 (using max buffer size instead of row 
> count).
> Besides API, we should also discuss how to merge Pandas UDFs of different 
> configurations. For example,
> {code}
> df.select(predict1(col("features"), predict2(col("features")))
> {code}
> when predict1 requests 100 rows per batch, while predict2 requests 120 rows 
> per batch.
> cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26410) Support per Pandas UDF configuration

2018-12-20 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16726155#comment-16726155
 ] 

Xiangrui Meng commented on SPARK-26410:
---

On the same cluster, there could be different workloads. For example, doing 
model inference over image data is very different from computing summary 
statistics. Each image record is about 100KB-10MB large, and the batch size 
doesn't need to be very large to utilize vectorized computation.

> Support per Pandas UDF configuration
> 
>
> Key: SPARK-26410
> URL: https://issues.apache.org/jira/browse/SPARK-26410
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> We use a "maxRecordsPerBatch" conf to control the batch sizes. However, the 
> "right" batch size usually depends on the task itself. It would be nice if 
> user can configure the batch size when they declare the Pandas UDF.
> This is orthogonal to SPARK-23258 (using max buffer size instead of row 
> count).
> Besides API, we should also discuss how to merge Pandas UDFs of different 
> configurations. For example,
> {code}
> df.select(predict1(col("features"), predict2(col("features")))
> {code}
> when predict1 requests 100 rows per batch, while predict2 requests 120 rows 
> per batch.
> cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26410) Support per Pandas UDF configuration

2018-12-20 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16725948#comment-16725948
 ] 

Li Jin commented on SPARK-26410:


I am curious why would user want to configure maxRecordsPerBatch? As far as I 
remember this is just a performance/memory usage conf. In your example, what's 
the reason that the user wants 100 rows per batch for one UDF and 120 rows per 
batch for another?

> Support per Pandas UDF configuration
> 
>
> Key: SPARK-26410
> URL: https://issues.apache.org/jira/browse/SPARK-26410
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> We use a "maxRecordsPerBatch" conf to control the batch sizes. However, the 
> "right" batch size usually depends on the task itself. It would be nice if 
> user can configure the batch size when they declare the Pandas UDF.
> This is orthogonal to SPARK-23258 (using max buffer size instead of row 
> count).
> Besides API, we should also discuss how to merge Pandas UDFs of different 
> configurations. For example,
> {code}
> df.select(predict1(col("features"), predict2(col("features")))
> {code}
> when predict1 requests 100 rows per batch, while predict2 requests 120 rows 
> per batch.
> cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org