Hi I have a spark project running on 4 Core 16GB (both master/worker) instance, now can anyone tell me what are all the things to keep monitoring so that my cluster/jobs will never go down?
I have created a small list which includes the following items, please extend the list if you know more: 1. Monitor Spark Master/Worker from failing 2. Monitor HDFS from getting filled/going down 3. Monitor network connectivity for master/worker 4. Monitor Spark Jobs from getting killed -- Thanks Best Regards