We are trying to add 6 spare servers to our existing cluster. Those
machines have more CPU cores, more memory. Unfortunately, 3 of them can
only use 2.5ā€ hard drives and total size of each node is about 7TB. The
other 3 nodes can only have 3.5ā€ hard drives, but have 48TB each nodes. In
addition, the size of hard drivers of each node in the existing cluster is
16TB.

I knew this is not ideal, but we could not afford buying more 2.5ā€ hard
drive to match to 48TB.

Iā€™m running Spark 2.0.1 on my CDH 5.3.2 cluster, will upgrade to Spark 2.1
soon, and usually start Spark using yarn-cluster.

Here are my questions:

   - I still want Spark executors running on on all nodes in the cluster so
   that they are still useful during shuffle, but I want more data written to
   the nodes with larger hard drives when saving the data set to HDFS. Is it
   possible?
   - Will data source output consider the free space on each node?

Thanks.

Reply via email to