The input data is a number of 16M files. On Mon, Oct 26, 2015 at 5:12 PM Jinfeng Li <liji...@gmail.com> wrote:
> I cat /proc/net/dev and then take the difference of received bytes before > and after the job. I also see a long-time peak (nearly 600Mb/s) in nload > interface. We have 18 machines and each machine receives 4.7G bytes. > > On Mon, Oct 26, 2015 at 5:00 PM Sean Owen <so...@cloudera.com> wrote: > >> -dev +user >> How are you measuring network traffic? >> It's not in general true that there will be zero network traffic, since >> not all executors are local to all data. That can be the situation in many >> cases but not always. >> >> On Mon, Oct 26, 2015 at 8:57 AM, Jinfeng Li <liji...@gmail.com> wrote: >> >>> Hi, I find that loading files from HDFS can incur huge amount of network >>> traffic. Input size is 90G and network traffic is about 80G. By my >>> understanding, local files should be read and thus no network communication >>> is needed. >>> >>> I use Spark 1.5.1, and the following is my code: >>> >>> val textRDD = sc.textFile("hdfs://master:9000/inputDir") >>> textRDD.count >>> >>> Jeffrey >>> >> >>