Hi, I'm trying to run spark applications on a standalone cluster, running on top of AWS. Since my slaves are spot instances, in some cases they are being killed and lost due to bid prices. When apps are running during this event, sometimes the spark application dies - and the driver process just hangs, and stays up forever (zombie process), capturing memory / cpu resources on the master machine. Then we have to manually kill -9 to free these resources.
Has anyone seen this kind of problem before? Any suggested solution to work around this problem? Thanks, Tomer