I am working on a problem which will eventually involve many millions of function calls. A have a small sample with several thousand calls working but when I try to scale up the amount of data things stall. I use 120 partitions and 116 finish in very little time. The remaining 4 seem to do all the work and stall after a fixed number (about 1000) calls and even after hours make no more progress.
This is my first large and complex job with spark and I would like any insight on how to debug the issue or even better why it might exist. The cluster has 15 machines and I am setting executor memory at 16G. Also what other questions are relevant to solving the issue