[ https://issues.apache.org/jira/browse/SPARK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16853955#comment-16853955 ]
Stavros Kontopoulos edited comment on SPARK-24815 at 6/2/19 12:07 PM: ---------------------------------------------------------------------- Yes its 1-1. What I am trying to say is that if you have N Spark tasks/partitions (max) processed in parallel and you want to do dynamic re-partitioning (Spark side only, _repartitioning_ is an _expensive_ activity at the Kafka side for sure afaik) because these N tasks got "fatter" you need the processing/batch_duration ratio metric, as your backlog task number is zero anyway, and you have no idea otherwise what is happening. After re-partitioning is done at Spark side you can fallback to the backlog metric (as now you will have tasks queued), but you could just keep the processing/batch_duration ratio. The alternative is not to have a 1-1 relationship at all at the beginning,as with the other sources and then it seems backlog is good enough as long as the processing time of offsets is analogous to their number. Regarding references there was a discussion last year: [http://apache-spark-developers-list.1001551.n3.nabble.com/Discussion-Clarification-regarding-Stateful-Aggregations-over-Structured-Streaming-td25941.html|http://apache-spark-developers-list.1001551.n3.nabble.com/Discussion-Clarification-regarding-Stateful-Aggregations-over-Structured-Streaming-td25941.html#a25942] There is this project that shows how the APIs can be extended: [https://github.com/chermenin/spark-states/tree/master/src/main/scala/ru/chermenin/spark/sql/execution/streaming/state.] In general the source of truth is the code as you know. The related Jira for the state backend is here: https://issues.apache.org/jira/browse/SPARK-13809 and related doc is here: [https://docs.google.com/document/d/1-ncawFx8JS5Zyfq1HAEGBx56RDet9wfVp_hDM8ZL254] was (Author: skonto): Yes its 1-1. What I am trying to say is that if you have N Spark tasks/partitions processed in parallel and you want to do dynamic re-partitioning (Spark side only, _repartitioning_ is an _expensive_ activity at the Kafka side for sure afaik, for spark depends) because these N tasks got "fatter" you need the processing/batch_duration ratio metric, as your backlog task number is zero anyway, and you have no idea otherwise what is happening. After re-partitioning is done at Spark side you can fallback to the backlog metric (as now you will have tasks queued), but no reason, you could just keep the processing/batch_duration ratio. The alternative is not to have a 1-1 relationship as with the other sources and then backlog is good enough. Regarding references there was a discussion last year: [http://apache-spark-developers-list.1001551.n3.nabble.com/Discussion-Clarification-regarding-Stateful-Aggregations-over-Structured-Streaming-td25941.html|http://apache-spark-developers-list.1001551.n3.nabble.com/Discussion-Clarification-regarding-Stateful-Aggregations-over-Structured-Streaming-td25941.html#a25942] There is this project that shows how the APIs can be extended: [https://github.com/chermenin/spark-states/tree/master/src/main/scala/ru/chermenin/spark/sql/execution/streaming/state.] In general the source of truth is the code as you know. The related Jira for the state backend is here: https://issues.apache.org/jira/browse/SPARK-13809 and related doc is here: [https://docs.google.com/document/d/1-ncawFx8JS5Zyfq1HAEGBx56RDet9wfVp_hDM8ZL254] > Structured Streaming should support dynamic allocation > ------------------------------------------------------ > > Key: SPARK-24815 > URL: https://issues.apache.org/jira/browse/SPARK-24815 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Structured Streaming > Affects Versions: 2.3.1 > Reporter: Karthik Palaniappan > Priority: Minor > > For batch jobs, dynamic allocation is very useful for adding and removing > containers to match the actual workload. On multi-tenant clusters, it ensures > that a Spark job is taking no more resources than necessary. In cloud > environments, it enables autoscaling. > However, if you set spark.dynamicAllocation.enabled=true and run a structured > streaming job, the batch dynamic allocation algorithm kicks in. It requests > more executors if the task backlog is a certain size, and removes executors > if they idle for a certain period of time. > Quick thoughts: > 1) Dynamic allocation should be pluggable, rather than hardcoded to a > particular implementation in SparkContext.scala (this should be a separate > JIRA). > 2) We should make a structured streaming algorithm that's separate from the > batch algorithm. Eventually, continuous processing might need its own > algorithm. > 3) Spark should print a warning if you run a structured streaming job when > Core's dynamic allocation is enabled -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org