[ 
https://issues.apache.org/jira/browse/SPARK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16853955#comment-16853955
 ] 

Stavros Kontopoulos commented on SPARK-24815:
---------------------------------------------

Yes its 1-1. What I am trying to say is that if you have N Spark 
tasks/partitions processed in parallel and you want to do dynamic 
re-partitioning (Spark side) because these N tasks got "fatter" you need the 
processing/batch_duration ratio metric, as your backlog task number is zero 
anyway, and you have no idea otherwise what is happening (unless you count the 
number of offsets/partition). After re-partitioning is done at Spark side you 
can fallback to the backlog metric (as now you will have tasks queued), but no 
reason, you could just keep the processing/batch_duration ratio. 

Regarding references there was a discussion last year: 
[http://apache-spark-developers-list.1001551.n3.nabble.com/Discussion-Clarification-regarding-Stateful-Aggregations-over-Structured-Streaming-td25941.html|http://apache-spark-developers-list.1001551.n3.nabble.com/Discussion-Clarification-regarding-Stateful-Aggregations-over-Structured-Streaming-td25941.html#a25942]

There is this project that shows how the APIs can be extended: 
[https://github.com/chermenin/spark-states/tree/master/src/main/scala/ru/chermenin/spark/sql/execution/streaming/state.]

In general the source of truth is the code as you know. The related Jira for 
the state backend is here: https://issues.apache.org/jira/browse/SPARK-13809 
and related doc is here: 
https://docs.google.com/document/d/1-ncawFx8JS5Zyfq1HAEGBx56RDet9wfVp_hDM8ZL254

 

 

 

> Structured Streaming should support dynamic allocation
> ------------------------------------------------------
>
>                 Key: SPARK-24815
>                 URL: https://issues.apache.org/jira/browse/SPARK-24815
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler, Structured Streaming
>    Affects Versions: 2.3.1
>            Reporter: Karthik Palaniappan
>            Priority: Minor
>
> For batch jobs, dynamic allocation is very useful for adding and removing 
> containers to match the actual workload. On multi-tenant clusters, it ensures 
> that a Spark job is taking no more resources than necessary. In cloud 
> environments, it enables autoscaling.
> However, if you set spark.dynamicAllocation.enabled=true and run a structured 
> streaming job, the batch dynamic allocation algorithm kicks in. It requests 
> more executors if the task backlog is a certain size, and removes executors 
> if they idle for a certain period of time.
> Quick thoughts:
> 1) Dynamic allocation should be pluggable, rather than hardcoded to a 
> particular implementation in SparkContext.scala (this should be a separate 
> JIRA).
> 2) We should make a structured streaming algorithm that's separate from the 
> batch algorithm. Eventually, continuous processing might need its own 
> algorithm.
> 3) Spark should print a warning if you run a structured streaming job when 
> Core's dynamic allocation is enabled



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to