We are trying to add 6 spare servers to our existing cluster. Those machines have more CPU cores, more memory. Unfortunately, 3 of them can only use 2.5ā hard drives and total size of each node is about 7TB. The other 3 nodes can only have 3.5ā hard drives, but have 48TB each nodes. In addition, the size of hard drivers of each node in the existing cluster is 16TB.
I knew this is not ideal, but we could not afford buying more 2.5ā hard drive to match to 48TB. Iām running Spark 2.0.1 on my CDH 5.3.2 cluster, will upgrade to Spark 2.1 soon, and usually start Spark using yarn-cluster. Here are my questions: - I still want Spark executors running on on all nodes in the cluster so that they are still useful during shuffle, but I want more data written to the nodes with larger hard drives when saving the data set to HDFS. Is it possible? - Will data source output consider the free space on each node? Thanks.