Does Spark consider the free space of hard drive of the data nodes?

Benyi Wang Wed, 08 Feb 2017 16:32:30 -0800

We are trying to add 6 spare servers to our existing cluster. Those
machines have more CPU cores, more memory. Unfortunately, 3 of them can
only use 2.5” hard drives and total size of each node is about 7TB. The
other 3 nodes can only have 3.5” hard drives, but have 48TB each nodes. In
addition, the size of hard drivers of each node in the existing cluster is
16TB.


I knew this is not ideal, but we could not afford buying more 2.5” hard
drive to match to 48TB.

I’m running Spark 2.0.1 on my CDH 5.3.2 cluster, will upgrade to Spark 2.1
soon, and usually start Spark using yarn-cluster.

Here are my questions:

   - I still want Spark executors running on on all nodes in the cluster so
   that they are still useful during shuffle, but I want more data written to
   the nodes with larger hard drives when saving the data set to HDFS. Is it
   possible?
   - Will data source output consider the free space on each node?

Thanks.

Does Spark consider the free space of hard drive of the data nodes?

Reply via email to