[jira] [Comment Edited] (SPARK-24815) Structured Streaming should support dynamic allocation

Stavros Kontopoulos (JIRA) Sun, 02 Jun 2019 05:08:13 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16853955#comment-16853955
 ]


Stavros Kontopoulos edited comment on SPARK-24815 at 6/2/19 12:07 PM:
----------------------------------------------------------------------

Yes its 1-1. What I am trying to say is that if you have N Spark 
tasks/partitions (max) processed in parallel and you want to do dynamic 
re-partitioning (Spark side only, _repartitioning_ is an _expensive_ activity 
at the Kafka side for sure afaik) because these N tasks got "fatter" you need 
the processing/batch_duration ratio metric, as your backlog task number is zero 
anyway, and you have no idea otherwise what is happening. After re-partitioning 
is done at Spark side you can fallback to the backlog metric (as now you will 
have tasks queued), but you could just keep the processing/batch_duration 
ratio. 

The alternative is not to have a 1-1 relationship at all at the beginning,as 
with the other sources and then it seems backlog is good enough as long as the 
processing time of offsets is analogous to their number.

Regarding references there was a discussion last year: 
[http://apache-spark-developers-list.1001551.n3.nabble.com/Discussion-Clarification-regarding-Stateful-Aggregations-over-Structured-Streaming-td25941.html|http://apache-spark-developers-list.1001551.n3.nabble.com/Discussion-Clarification-regarding-Stateful-Aggregations-over-Structured-Streaming-td25941.html#a25942]

There is this project that shows how the APIs can be extended: 
[https://github.com/chermenin/spark-states/tree/master/src/main/scala/ru/chermenin/spark/sql/execution/streaming/state.]

In general the source of truth is the code as you know. The related Jira for 
the state backend is here: https://issues.apache.org/jira/browse/SPARK-13809 
and related doc is here: 
[https://docs.google.com/document/d/1-ncawFx8JS5Zyfq1HAEGBx56RDet9wfVp_hDM8ZL254]

 

 

 


was (Author: skonto):
Yes its 1-1. What I am trying to say is that if you have N Spark 
tasks/partitions processed in parallel and you want to do dynamic 
re-partitioning (Spark side only, _repartitioning_ is an _expensive_ activity 
at the Kafka side for sure afaik, for spark depends) because these N tasks got 
"fatter" you need the processing/batch_duration ratio metric, as your backlog 
task number is zero anyway, and you have no idea otherwise what is happening. 
After re-partitioning is done at Spark side you can fallback to the backlog 
metric (as now you will have tasks queued), but no reason, you could just keep 
the processing/batch_duration ratio. 

The alternative is not to have a 1-1 relationship as with the other sources and 
then backlog is good enough.

Regarding references there was a discussion last year: 
[http://apache-spark-developers-list.1001551.n3.nabble.com/Discussion-Clarification-regarding-Stateful-Aggregations-over-Structured-Streaming-td25941.html|http://apache-spark-developers-list.1001551.n3.nabble.com/Discussion-Clarification-regarding-Stateful-Aggregations-over-Structured-Streaming-td25941.html#a25942]

There is this project that shows how the APIs can be extended: 
[https://github.com/chermenin/spark-states/tree/master/src/main/scala/ru/chermenin/spark/sql/execution/streaming/state.]

In general the source of truth is the code as you know. The related Jira for 
the state backend is here: https://issues.apache.org/jira/browse/SPARK-13809 
and related doc is here: 
[https://docs.google.com/document/d/1-ncawFx8JS5Zyfq1HAEGBx56RDet9wfVp_hDM8ZL254]

 

 

 

> Structured Streaming should support dynamic allocation
> ------------------------------------------------------
>
>                 Key: SPARK-24815
>                 URL: https://issues.apache.org/jira/browse/SPARK-24815
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler, Structured Streaming
>    Affects Versions: 2.3.1
>            Reporter: Karthik Palaniappan
>            Priority: Minor
>
> For batch jobs, dynamic allocation is very useful for adding and removing 
> containers to match the actual workload. On multi-tenant clusters, it ensures 
> that a Spark job is taking no more resources than necessary. In cloud 
> environments, it enables autoscaling.
> However, if you set spark.dynamicAllocation.enabled=true and run a structured 
> streaming job, the batch dynamic allocation algorithm kicks in. It requests 
> more executors if the task backlog is a certain size, and removes executors 
> if they idle for a certain period of time.
> Quick thoughts:
> 1) Dynamic allocation should be pluggable, rather than hardcoded to a 
> particular implementation in SparkContext.scala (this should be a separate 
> JIRA).
> 2) We should make a structured streaming algorithm that's separate from the 
> batch algorithm. Eventually, continuous processing might need its own 
> algorithm.
> 3) Spark should print a warning if you run a structured streaming job when 
> Core's dynamic allocation is enabled



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24815) Structured Streaming should support dynamic allocation

Reply via email to