Note, none of this applies to Direct streaming approaches, only receiver based Dstreams.
You can think of a receiver as a long running task that never finishes. Each receiver is submitted to an executor slot somewhere, it then runs indefinitely and internally has a method which passes records over to a block management system. There is a timing that you set which decides when each block is "done" and records after that time has passed go into the next block (See parameter <https://spark.apache.org/docs/latest/configuration.html#spark-streaming> spark.streaming.blockInterval) Once a block is done it can be processed in the next Spark batch.. The gap between a block starting and a block being finished is why you can lose data in Receiver streaming without WriteAheadLoging. Usually your block interval is divisible into your batch interval so you'll get X blocks per batch. Each block becomes one partition of the job being done in a Streaming batch. Multiple receivers can be unified into a single dstream, which just means the blocks produced by all of those receivers are handled in the same Streaming batch. So if you have 5 different receivers, you need at minimum 6 executor cores. 1 core for each receiver, and 1 core to actually do your processing work. In a real world case you probably want significantly more cores on the processing side than just 1. Without repartitioning you will never have more that A quick example I run 5 receivers with block interval of 100ms and spark batch interval of 1 second. I use union to group them all together, I will most likely end up with one Spark Job for each batch every second running with 50 partitions (1000ms / 100(ms / partition / receiver) * 5 receivers). If I have a total of 10 cores in the system. 5 of them are running receivers, The remaining 5 must process the 50 partitions of data generated by the last second of work. And again, just to reiterate, if you are doing a direct streaming approach or structured streaming, none of this applies. On Sat, Aug 8, 2020 at 10:03 AM Dark Crusader <relinquisheddra...@gmail.com> wrote: > Hi, > > I'm having some trouble figuring out how receivers tie into spark > driver-executor structure. > Do all executors have a receiver that is blocked as soon as it > receives some stream data? > Or can multiple streams of data be taken as input into a single executor? > > I have stream data coming in at every second coming from 5 different > sources. I want to aggregate data from each of them. Does this mean I need > 5 executors or does it have to do with threads on the executor? > > I might be mixing in a few concepts here. Any help would be appreciated. > Thank you. >