I have tried the configuration calculator sheet provided by Cloudera as
well but no improvements. However, ignoring the 17 mil operation to begin
with.
Let consider the simple sort on yarn and spark which has tremendous
difference.
The operation is simple Selected numeric col to be sorted
What does your Spark job do? Have you tried standard configurations and
changing them gradually?
Have you checked the logfiles/ui which tasks take long?
17 Mio records does not sound much, but it depends what you do with it.
I do not think that for such a small "cluster" it makes sense to
Performance issue / time taken to complete spark job in yarn is 4 x slower,
when considered spark standalone mode. However, in spark standalone mode
jobs often fails with executor lost issue.
Hardware configuration
32GB RAM 8 Cores (16) and 1 TB HDD 3 (1 Master and 2 Workers)
Spark