Hello Everyone I am trying to understand the internals of Spark Streaming (not Structured Streaming), specifically the way tasks see the DStream. I am going over the source code of Spark in scala, here <https://github.com/apache/spark>. I understand the call stack:
ExecutorCoarseGrainedBackend (main) -> Executor (launchtask) -> TaskRunner (Runnable).run() -> task.run(...) I understand the DStream really is a hashmap of RDDs but I am trying to understand the way tasks see the DStream. I know that there are basically 2 approaches to Kafka Spark integration: - *Receiver based using High Level Kafka Consumer APIs* Here a new (micro-)batch is created at every batch interval (say 5 secs) with say 5 partitions (=> 1 sec block interval) by the *Receiver* task and handed downstream to *Regular* tasks. *Question:* Considering our example where every microbatch is created every 5 secs; has exactly 5 partitions and all these partitions of all the microbatches are supposed to be DAG-ged downstream the exact same way, is the same *regular* task re-used over and over again for the same partition id of every microbatch (RDD) as a long running task? e.g. If *ubatch1* of partitions *(P1,P2,P3,P4,P5)* at time *T0* is assigned to task ids *(T1, T2, T3, T4, T5)*, will *ubatch2* of partitions *(P1',P2',P3',P4',P5')* at time *T5* be also assigned to the same set of tasks *(T1, T2, T3, T4, T5)* or will new tasks *(T6, T7, T8, T9, T10)* be created for *ubatch2*? If the latter is the case then, wouldn't it be performance intensive having to send new tasks over the network to executors every 5 seconds when you already know that there are tasks doing the exact same thing and could be re-used as long running tasks? - *Direct using Low Level Kafka Consumer APIs* Here a Kafka Partition maps to a Spark Partition and therefore a Task. Again, considering 5 Kafka partitions for a topic *t*, we get 5 Spark partitions and their corresponding tasks. *Question:* Say, the *ubatch1* at *T0* has partitions *(P1,P2,P3,P4,P5)* assigned to tasks *(T1, T2, T3, T4, T5).* Will *ubatch2* of partitions *(P1',P2',P3',P4',P5')* at time *T5* be also assigned to the same set of tasks *(T1, T2, T3, T4, T5)* or will new tasks *(T6, T7, T8, T9, T10)* be created for *ubatch2*? I have put up this question on SO @ https://stackoverflow.com/questions/56102094/kafka-spark-streaming-integration-relation-between-tasks-and-dstreams . Regards Sheel