Hi everyone

Sorry about the noob question, but I am struggling to understand ways to
create DStreams in Spark. Here is my understanding based on what I could
gather from documentation and studying Spark code (as well as some hunch).
Please correct me if I am wrong.

1. In most cases, one would either extend ReceiverInputDStream
or InputDStream to create a custom DStream that pulls data from an external
system.
 - ReceiverInputDStream is used to distributed data receiving code (i.e.
Receiver) to workers. N instances of ReceiverInputDStream results in
distributing to N workers. No control on which worker nodes executes which
instance of receiving code.
 - InputDStream is used to run receiving code in driver. The driver creates
RDDs which are distributed to workers nodes which run processing logic. No
way to control on how RDD gets distributed to workers unless one does
repartitioning of generated RDDs.

2. DStreams or RDDs get no feedback on whether the processing was
successful or not. This means, one can't implement re-pull in case of
failures.

The above makes me realize that it is not trivial to implement a streaming
use case with atleast once processing guarantee. Any thoughts on building
reliable real time processing system using Spark will be appreciated.

Reply via email to