Hi, I find that loading files from HDFS can incur huge amount of network traffic. Input size is 90G and network traffic is about 80G. By my understanding, local files should be read and thus no network communication is needed.
I use Spark 1.5.1, and the following is my code: val textRDD = sc.textFile("hdfs://master:9000/inputDir") textRDD.count Jeffrey