Re: Loading Files from HDFS Incurs Network Communication

Sean Owen Mon, 26 Oct 2015 02:32:37 -0700

Hm, how about the opposite question -- do you have just 1 executor? then
again everything will be remote except for a small fraction of blocks.


On Mon, Oct 26, 2015 at 9:28 AM, Jinfeng Li <liji...@gmail.com> wrote:

> Replication factor is 3 and we have 18 data nodes. We check HDFS webUI,
> data is evenly distributed among 18 machines.
>
>
> On Mon, Oct 26, 2015 at 5:18 PM Sean Owen <so...@cloudera.com> wrote:
>
>> Have a look at your HDFS replication, and where the blocks are for these
>> files. For example, if you had only 2 HDFS data nodes, then data would be
>> remote to 16 of 18 workers and always entail a copy.
>>
>> On Mon, Oct 26, 2015 at 9:12 AM, Jinfeng Li <liji...@gmail.com> wrote:
>>
>>> I cat /proc/net/dev and then take the difference of received bytes
>>> before and after the job. I also see a long-time peak (nearly 600Mb/s) in
>>> nload interface.  We have 18 machines and each machine receives 4.7G bytes.
>>>
>>> On Mon, Oct 26, 2015 at 5:00 PM Sean Owen <so...@cloudera.com> wrote:
>>>
>>>> -dev +user
>>>> How are you measuring network traffic?
>>>> It's not in general true that there will be zero network traffic, since
>>>> not all executors are local to all data. That can be the situation in many
>>>> cases but not always.
>>>>
>>>> On Mon, Oct 26, 2015 at 8:57 AM, Jinfeng Li <liji...@gmail.com> wrote:
>>>>
>>>>> Hi, I find that loading files from HDFS can incur huge amount of
>>>>> network traffic. Input size is 90G and network traffic is about 80G. By my
>>>>> understanding, local files should be read and thus no network 
>>>>> communication
>>>>> is needed.
>>>>>
>>>>> I use Spark 1.5.1, and the following is my code:
>>>>>
>>>>> val textRDD = sc.textFile("hdfs://master:9000/inputDir")
>>>>> textRDD.count
>>>>>
>>>>> Jeffrey
>>>>>
>>>>
>>>>
>>

Re: Loading Files from HDFS Incurs Network Communication

Reply via email to