Hm, how about the opposite question -- do you have just 1 executor? then again everything will be remote except for a small fraction of blocks.
On Mon, Oct 26, 2015 at 9:28 AM, Jinfeng Li <liji...@gmail.com> wrote: > Replication factor is 3 and we have 18 data nodes. We check HDFS webUI, > data is evenly distributed among 18 machines. > > > On Mon, Oct 26, 2015 at 5:18 PM Sean Owen <so...@cloudera.com> wrote: > >> Have a look at your HDFS replication, and where the blocks are for these >> files. For example, if you had only 2 HDFS data nodes, then data would be >> remote to 16 of 18 workers and always entail a copy. >> >> On Mon, Oct 26, 2015 at 9:12 AM, Jinfeng Li <liji...@gmail.com> wrote: >> >>> I cat /proc/net/dev and then take the difference of received bytes >>> before and after the job. I also see a long-time peak (nearly 600Mb/s) in >>> nload interface. We have 18 machines and each machine receives 4.7G bytes. >>> >>> On Mon, Oct 26, 2015 at 5:00 PM Sean Owen <so...@cloudera.com> wrote: >>> >>>> -dev +user >>>> How are you measuring network traffic? >>>> It's not in general true that there will be zero network traffic, since >>>> not all executors are local to all data. That can be the situation in many >>>> cases but not always. >>>> >>>> On Mon, Oct 26, 2015 at 8:57 AM, Jinfeng Li <liji...@gmail.com> wrote: >>>> >>>>> Hi, I find that loading files from HDFS can incur huge amount of >>>>> network traffic. Input size is 90G and network traffic is about 80G. By my >>>>> understanding, local files should be read and thus no network >>>>> communication >>>>> is needed. >>>>> >>>>> I use Spark 1.5.1, and the following is my code: >>>>> >>>>> val textRDD = sc.textFile("hdfs://master:9000/inputDir") >>>>> textRDD.count >>>>> >>>>> Jeffrey >>>>> >>>> >>>> >>