Kafka Spark Streaming integration : Relationship between DStreams and Tasks

Sheel Pancholi Sun, 12 May 2019 11:59:05 -0700

Hello Everyone
I am trying to understand the internals of Spark Streaming (not Structured
Streaming), specifically the way tasks see the DStream. I am going over the
source code of Spark in scala, here <https://github.com/apache/spark>. I
understand the call stack:


ExecutorCoarseGrainedBackend (main) -> Executor (launchtask) ->
TaskRunner (Runnable).run() -> task.run(...)

I understand the DStream really is a hashmap of RDDs but I am trying to
understand the way tasks see the DStream. I know that there are basically 2
approaches to Kafka Spark integration:

   -

   *Receiver based using High Level Kafka Consumer APIs*

   Here a new (micro-)batch is created at every batch interval (say 5 secs)
   with say 5 partitions (=> 1 sec block interval) by the *Receiver* task
   and handed downstream to *Regular* tasks.

   *Question:* Considering our example where every microbatch is created
   every 5 secs; has exactly 5 partitions and all these partitions of all the
   microbatches are supposed to be DAG-ged downstream the exact same way, is
   the same *regular* task re-used over and over again for the same
   partition id of every microbatch (RDD) as a long running task? e.g.

   If *ubatch1* of partitions *(P1,P2,P3,P4,P5)* at time *T0* is assigned
   to task ids *(T1, T2, T3, T4, T5)*, will *ubatch2* of partitions
   *(P1',P2',P3',P4',P5')* at time *T5* be also assigned to the same set of
   tasks *(T1, T2, T3, T4, T5)* or will new tasks *(T6, T7, T8, T9, T10)* be
   created for *ubatch2*?

   If the latter is the case then, wouldn't it be performance intensive
   having to send new tasks over the network to executors every 5 seconds when
   you already know that there are tasks doing the exact same thing and could
   be re-used as long running tasks?
   -

   *Direct using Low Level Kafka Consumer APIs*

   Here a Kafka Partition maps to a Spark Partition and therefore a Task.
   Again, considering 5 Kafka partitions for a topic *t*, we get 5 Spark
   partitions and their corresponding tasks.

   *Question:* Say, the *ubatch1* at *T0* has partitions
*(P1,P2,P3,P4,P5)* assigned
   to tasks *(T1, T2, T3, T4, T5).* Will *ubatch2* of partitions
   *(P1',P2',P3',P4',P5')* at time *T5* be also assigned to the same set of
   tasks *(T1, T2, T3, T4, T5)* or will new tasks *(T6, T7, T8, T9, T10)* be
   created for *ubatch2*?


I have put up this question on SO @
https://stackoverflow.com/questions/56102094/kafka-spark-streaming-integration-relation-between-tasks-and-dstreams
.

Regards
Sheel

Kafka Spark Streaming integration : Relationship between DStreams and Tasks

Reply via email to