Hi, We have a quite long winded Spark application we inherited with many stages. When we run on our spark cluster, things start off well enough. Workers are busy, lots of progress made, etc. etc. However, 30 minutes into processing, we see CPU usage of the workers drop drastically. At this time, we also see that the driver is maxing out exactly one core (though we've given it more than one), and its ram usage is creeping up. At this time, there's no logs coming out on the driver. Everything seems to stop, and then it suddenly starts working, and the workers start working again. The driver ram doesn't go down, but flatlines. A few minutes later, the same thing happens again - the world seems to stop. However, the driver soon crashes with an out of memory exception.
What could be causing this sort of behaviour on the driver? We don't have any collect() or similar functions in the code. We're reading in from Azure blobs, processing, and writing back to Azure blobs. Where should we start in trying to get to the bottom of this? We're running Spark 2.4.1 in a stand-alone cluster. Thanks, Ashic.