Re: writing to hdfs on master node much faster

2015-04-20 Thread Sean Owen
What machines are HDFS data nodes -- just your master? that would
explain it. Otherwise, is it actually the write that's slow or is
something else you're doing much faster on the master for other
reasons maybe? like you're actually shipping data via the master first
in some local computation? so the master's executor has the result
much faster?

On Mon, Apr 20, 2015 at 12:21 PM, jamborta jambo...@gmail.com wrote:
 Hi all,

 I have a three node cluster with identical hardware. I am trying a workflow
 where it reads data from hdfs, repartitions it and runs a few map operations
 then writes the results back to hdfs.

 It looks like that all the computation, including the repartitioning and the
 maps complete within similar time intervals on all the nodes, except when it
 writes it back to HDFS when the master node does the job way much faster
 then the slaves (15s for each block as opposed to 1.2 min for the slaves).

 Any suggestion what the reason might be?

 thanks,



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/writing-to-hdfs-on-master-node-much-faster-tp22570.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: writing to hdfs on master node much faster

2015-04-20 Thread Tamas Jambor
Not sure what would slow it down as the repartition completes equally fast
on all nodes, implying that the data is available on all, then there are a
few computation steps none of them local on the master.

On Mon, Apr 20, 2015 at 12:57 PM, Sean Owen so...@cloudera.com wrote:

 What machines are HDFS data nodes -- just your master? that would
 explain it. Otherwise, is it actually the write that's slow or is
 something else you're doing much faster on the master for other
 reasons maybe? like you're actually shipping data via the master first
 in some local computation? so the master's executor has the result
 much faster?

 On Mon, Apr 20, 2015 at 12:21 PM, jamborta jambo...@gmail.com wrote:
  Hi all,
 
  I have a three node cluster with identical hardware. I am trying a
 workflow
  where it reads data from hdfs, repartitions it and runs a few map
 operations
  then writes the results back to hdfs.
 
  It looks like that all the computation, including the repartitioning and
 the
  maps complete within similar time intervals on all the nodes, except
 when it
  writes it back to HDFS when the master node does the job way much faster
  then the slaves (15s for each block as opposed to 1.2 min for the
 slaves).
 
  Any suggestion what the reason might be?
 
  thanks,
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/writing-to-hdfs-on-master-node-much-faster-tp22570.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 



RE: writing to hdfs on master node much faster

2015-04-20 Thread Evo Eftimov
Check whether your partitioning results in balanced partitions ie partitions 
with similar sizes - one of the reasons for the performance differences 
observed by you may be that after your explicit repartitioning, the partition 
on your master node is much smaller than the RDD partitions on the other 2 
nodes  

-Original Message-
From: Sean Owen [mailto:so...@cloudera.com] 
Sent: Monday, April 20, 2015 12:57 PM
To: jamborta
Cc: user@spark.apache.org
Subject: Re: writing to hdfs on master node much faster

What machines are HDFS data nodes -- just your master? that would explain it. 
Otherwise, is it actually the write that's slow or is something else you're 
doing much faster on the master for other reasons maybe? like you're actually 
shipping data via the master first in some local computation? so the master's 
executor has the result much faster?

On Mon, Apr 20, 2015 at 12:21 PM, jamborta jambo...@gmail.com wrote:
 Hi all,

 I have a three node cluster with identical hardware. I am trying a 
 workflow where it reads data from hdfs, repartitions it and runs a few 
 map operations then writes the results back to hdfs.

 It looks like that all the computation, including the repartitioning 
 and the maps complete within similar time intervals on all the nodes, 
 except when it writes it back to HDFS when the master node does the 
 job way much faster then the slaves (15s for each block as opposed to 1.2 min 
 for the slaves).

 Any suggestion what the reason might be?

 thanks,



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/writing-to-hdfs-on
 -master-node-much-faster-tp22570.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 
 additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



writing to hdfs on master node much faster

2015-04-20 Thread jamborta
Hi all,

I have a three node cluster with identical hardware. I am trying a workflow
where it reads data from hdfs, repartitions it and runs a few map operations
then writes the results back to hdfs.

It looks like that all the computation, including the repartitioning and the
maps complete within similar time intervals on all the nodes, except when it
writes it back to HDFS when the master node does the job way much faster
then the slaves (15s for each block as opposed to 1.2 min for the slaves). 

Any suggestion what the reason might be?

thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/writing-to-hdfs-on-master-node-much-faster-tp22570.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org