[ 
https://issues.apache.org/jira/browse/SPARK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16853745#comment-16853745
 ] 

Karthik Palaniappan commented on SPARK-24815:
---------------------------------------------

Just to clarify: does Spark allow having multiple tasks per Kafka partition? 
This doc implies that they are 1:1: 
[https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html]. 
Batch dynamic allocation gives you enough executors to process all tasks in 
running stage in parallel. I assume you end up with too much data per 
partition, your only recourse would be to increase the number of kafka 
partitions.

Also, does Spark not handle state rebalancing today? In other words, is SS 
already fault tolerant to node failures?

Do you have any good references (JIRAs, design docs) on how Spark stores 
streaming state internally? I see plenty of blogs and articles on how to do 
stateful processing, but not on how it works under the hood.

> Structured Streaming should support dynamic allocation
> ------------------------------------------------------
>
>                 Key: SPARK-24815
>                 URL: https://issues.apache.org/jira/browse/SPARK-24815
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler, Structured Streaming
>    Affects Versions: 2.3.1
>            Reporter: Karthik Palaniappan
>            Priority: Minor
>
> For batch jobs, dynamic allocation is very useful for adding and removing 
> containers to match the actual workload. On multi-tenant clusters, it ensures 
> that a Spark job is taking no more resources than necessary. In cloud 
> environments, it enables autoscaling.
> However, if you set spark.dynamicAllocation.enabled=true and run a structured 
> streaming job, the batch dynamic allocation algorithm kicks in. It requests 
> more executors if the task backlog is a certain size, and removes executors 
> if they idle for a certain period of time.
> Quick thoughts:
> 1) Dynamic allocation should be pluggable, rather than hardcoded to a 
> particular implementation in SparkContext.scala (this should be a separate 
> JIRA).
> 2) We should make a structured streaming algorithm that's separate from the 
> batch algorithm. Eventually, continuous processing might need its own 
> algorithm.
> 3) Spark should print a warning if you run a structured streaming job when 
> Core's dynamic allocation is enabled



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to