I'm trying to understand the conceptual difference between these two
configurations in term of performance (using Spark standalone cluster)
Case 1:
1 Node
60 cores
240G of memory
50G of data on local file system
Case 2:
6 Nodes
10 cores per node
40G of memory per node
50G of data on HDFS
nodes
That depends! See inline. I am assuming that when you said replacing
local disk with HDFS in case 1, you are connected to a separate HDFS
cluster (like case 1) with a single 10G link. Also assumign that all
nodes (1 in case 1, and 6 in case 2) are the worker nodes, and the
spark application