Re: Loading Files from HDFS Incurs Network Communication

Jinfeng Li Mon, 26 Oct 2015 02:16:16 -0700

The input data is a number of 16M files.

On Mon, Oct 26, 2015 at 5:12 PM Jinfeng Li <liji...@gmail.com> wrote:


> I cat /proc/net/dev and then take the difference of received bytes before
> and after the job. I also see a long-time peak (nearly 600Mb/s) in nload
> interface.  We have 18 machines and each machine receives 4.7G bytes.
>
> On Mon, Oct 26, 2015 at 5:00 PM Sean Owen <so...@cloudera.com> wrote:
>
>> -dev +user
>> How are you measuring network traffic?
>> It's not in general true that there will be zero network traffic, since
>> not all executors are local to all data. That can be the situation in many
>> cases but not always.
>>
>> On Mon, Oct 26, 2015 at 8:57 AM, Jinfeng Li <liji...@gmail.com> wrote:
>>
>>> Hi, I find that loading files from HDFS can incur huge amount of network
>>> traffic. Input size is 90G and network traffic is about 80G. By my
>>> understanding, local files should be read and thus no network communication
>>> is needed.
>>>
>>> I use Spark 1.5.1, and the following is my code:
>>>
>>> val textRDD = sc.textFile("hdfs://master:9000/inputDir")
>>> textRDD.count
>>>
>>> Jeffrey
>>>
>>
>>

Re: Loading Files from HDFS Incurs Network Communication

Reply via email to