Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Sean Owen
Hm, now I wonder if it's the same issue here: https://issues.apache.org/jira/browse/SPARK-10149 Does the setting described there help? On Mon, Oct 26, 2015 at 11:39 AM, Jinfeng Li wrote: > Hi, I have already tried the same code with Spark 1.3.1, there is no such > problem.

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Sean Owen
-dev +user How are you measuring network traffic? It's not in general true that there will be zero network traffic, since not all executors are local to all data. That can be the situation in many cases but not always. On Mon, Oct 26, 2015 at 8:57 AM, Jinfeng Li wrote: > Hi,

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Sean Owen
Hm, how about the opposite question -- do you have just 1 executor? then again everything will be remote except for a small fraction of blocks. On Mon, Oct 26, 2015 at 9:28 AM, Jinfeng Li wrote: > Replication factor is 3 and we have 18 data nodes. We check HDFS webUI, > data

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Jinfeng Li
I use standalone mode. Each machine has 4 workers. Spark is deployed correctly as webUI and jps command can show that. Actually, we are a team and already use spark for nearly half a year, started from Spark 1.3.1. We find this problem on one of our application and I write a simple program to

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Sean Owen
Yeah, are these stats actually reflecting data read locally, like through the loopback interface? I'm also no expert on the internals here but this may be measuring effectively local reads. Or are you sure it's not? On Mon, Oct 26, 2015 at 11:14 AM, Steve Loughran wrote:

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Steve Loughran
> On 26 Oct 2015, at 09:28, Jinfeng Li wrote: > > Replication factor is 3 and we have 18 data nodes. We check HDFS webUI, data > is evenly distributed among 18 machines. > every block in HDFS (usually 64-128-256 MB) is distributed across three machines, meaning 3

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Jinfeng Li
Replication factor is 3 and we have 18 data nodes. We check HDFS webUI, data is evenly distributed among 18 machines. On Mon, Oct 26, 2015 at 5:18 PM Sean Owen wrote: > Have a look at your HDFS replication, and where the blocks are for these > files. For example, if you had

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Jinfeng Li
Hi, I have already tried the same code with Spark 1.3.1, there is no such problem. The configuration files are all directly copied from Spark 1.5.1. I feel it is a bug on Spark 1.5.1. Thanks a lot for your response. On Mon, Oct 26, 2015 at 7:21 PM Sean Owen wrote: > Yeah,

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Jinfeng Li
I cat /proc/net/dev and then take the difference of received bytes before and after the job. I also see a long-time peak (nearly 600Mb/s) in nload interface. We have 18 machines and each machine receives 4.7G bytes. On Mon, Oct 26, 2015 at 5:00 PM Sean Owen wrote: > -dev

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Sean Owen
Have a look at your HDFS replication, and where the blocks are for these files. For example, if you had only 2 HDFS data nodes, then data would be remote to 16 of 18 workers and always entail a copy. On Mon, Oct 26, 2015 at 9:12 AM, Jinfeng Li wrote: > I cat /proc/net/dev and

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Jinfeng Li
The input data is a number of 16M files. On Mon, Oct 26, 2015 at 5:12 PM Jinfeng Li wrote: > I cat /proc/net/dev and then take the difference of received bytes before > and after the job. I also see a long-time peak (nearly 600Mb/s) in nload > interface. We have 18 machines

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Jinfeng Li
Hi, yes, it should be the same issue, but the solution doesn't apply in our situation. Anyway, thanks a lot for your replies. On Mon, Oct 26, 2015 at 7:44 PM Sean Owen wrote: > Hm, now I wonder if it's the same issue here: > https://issues.apache.org/jira/browse/SPARK-10149

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Steve Loughran
On 26 Oct 2015, at 11:21, Sean Owen > wrote: Yeah, are these stats actually reflecting data read locally, like through the loopback interface? I'm also no expert on the internals here but this may be measuring effectively local reads. Or are you