Hi everyone Sorry about the noob question, but I am struggling to understand ways to create DStreams in Spark. Here is my understanding based on what I could gather from documentation and studying Spark code (as well as some hunch). Please correct me if I am wrong.
1. In most cases, one would either extend ReceiverInputDStream or InputDStream to create a custom DStream that pulls data from an external system. - ReceiverInputDStream is used to distributed data receiving code (i.e. Receiver) to workers. N instances of ReceiverInputDStream results in distributing to N workers. No control on which worker nodes executes which instance of receiving code. - InputDStream is used to run receiving code in driver. The driver creates RDDs which are distributed to workers nodes which run processing logic. No way to control on how RDD gets distributed to workers unless one does repartitioning of generated RDDs. 2. DStreams or RDDs get no feedback on whether the processing was successful or not. This means, one can't implement re-pull in case of failures. The above makes me realize that it is not trivial to implement a streaming use case with atleast once processing guarantee. Any thoughts on building reliable real time processing system using Spark will be appreciated.