Hi,

Simple Question about Spark Distribution of Small Dataset.

Let's say I have 8 machine with 48 cores and 48GB of RAM as a cluster.
Dataset  (format is ORC by Hive) is so small like 1GB, but I copied it to
HDFS.

1) if spark-sql run the dataset distributed on HDFS in each machine, what
happens to the job? I meant one machine handles the dataset because it is
so small?

2) but the thing is dataset is already distributed in each machine.
or each machine handles the distributed dataset and send it to the Master
Node?

Could you explain about this in detail in a distributed way?

Best,
Phil

Reply via email to