Hi, Simple Question about Spark Distribution of Small Dataset.
Let's say I have 8 machine with 48 cores and 48GB of RAM as a cluster. Dataset (format is ORC by Hive) is so small like 1GB, but I copied it to HDFS. 1) if spark-sql run the dataset distributed on HDFS in each machine, what happens to the job? I meant one machine handles the dataset because it is so small? 2) but the thing is dataset is already distributed in each machine. or each machine handles the distributed dataset and send it to the Master Node? Could you explain about this in detail in a distributed way? Best, Phil