I'm not entirely sure what the behaviour is when writing to remote cluster.
It could be that the connections are being established for every element in
your dataframe, perhaps having to use for each partition may reduce the
number of connections? You may have to look at what the executors do when
unsubscribe
I managed to make the failing stage work by increasing memoryOverhead to
something ridiculous > 50%. Our spark.executor.memory = 12gb and I bumped
spark.mesos.executor.memoryOverhead=8G
*Can someone explain why this solved the issue?* As I understand, usage of
memoryOverhead is for VM overhead
Not for the stage that fails, all it does is read and write - the number of
tasks is # of cores * # of executor instances. For us that is 60 (3 cores
20 executors)
The input partition size for the failing stage, when spark reads the 20
files each 132M, it comes out to be 40 partitions.
On Fri,