This usually happens when one of the worker is stuck on GC Pause and it
times out. Enable the following configurations while creating sparkContext:
sc.set(spark.rdd.compress,true)
sc.set(spark.storage.memoryFraction,1)
sc.set(spark.core.connection.ack.wait.timeout,6000)
Hello all. I have been running a Spark Job that eventually needs to do a large
join.
24 million x 150 million
A broadcast join is infeasible in this instance clearly, so I am instead
attempting to do it with Hash Partitioning by defining a custom partitioner as:
class