Hi, I have questions regarding spark structured streaming deployment recommendation
I have +- 100 kafka topics that can be processed using similar code block. I am using pyspark 2.2.1 Here is the idea: TOPIC_LIST = ["topic_a","topic_b"."topic_c"] stream = {} for t in TOPIC_LIST: stream[t] = spark \ .readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "kafka-broker1:9092,kafka-broker2:9092") \ .option("subscribe", "{0}".format(t)) \ .load().writeStream \ .trigger(processingTime="10 seconds") \ .format("console") \ .start() So far, I have these options: 1. do a single spark submit for all topics 2. do multiple spark submit per topics I am thinking of deploying this using supervisor or upstart so when spark app is down, I can get notified. By submitting single spark app I can have more control on server resources but a little bit hard to monitor if one app is having problem. For multiple spark submits I am afraid I need to provide +- 100 cpu cores which I doubt will be fully utilized as some of the topics do not have high througput, however by creating +- 100 bash scripts maintained using supervisor I can get alert for each of them. For monitoring I am considering using this -> https://argus-sec.com/monitoring-spark-prometheus/ Any recommendation or advices guys? Thanks Regards, Abraham