subject:"writing to hdfs on master node much faster"

Re: writing to hdfs on master node much faster

2015-04-20 Thread Sean Owen

What machines are HDFS data nodes -- just your master? that would
explain it. Otherwise, is it actually the write that's slow or is
something else you're doing much faster on the master for other
reasons maybe? like you're actually shipping data via the master first
in some local computation? so the master's executor has the result
much faster?

On Mon, Apr 20, 2015 at 12:21 PM, jamborta jambo...@gmail.com wrote:
 Hi all,

 I have a three node cluster with identical hardware. I am trying a workflow
 where it reads data from hdfs, repartitions it and runs a few map operations
 then writes the results back to hdfs.

 It looks like that all the computation, including the repartitioning and the
 maps complete within similar time intervals on all the nodes, except when it
 writes it back to HDFS when the master node does the job way much faster
 then the slaves (15s for each block as opposed to 1.2 min for the slaves).

 Any suggestion what the reason might be?

 thanks,



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/writing-to-hdfs-on-master-node-much-faster-tp22570.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: writing to hdfs on master node much faster

2015-04-20 Thread Tamas Jambor

Not sure what would slow it down as the repartition completes equally fast
on all nodes, implying that the data is available on all, then there are a
few computation steps none of them local on the master.

On Mon, Apr 20, 2015 at 12:57 PM, Sean Owen so...@cloudera.com wrote:

What machines are HDFS data nodes -- just your master? that would
explain it. Otherwise, is it actually the write that's slow or is
something else you're doing much faster on the master for other
reasons maybe? like you're actually shipping data via the master first
in some local computation? so the master's executor has the result
much faster?

On Mon, Apr 20, 2015 at 12:21 PM, jamborta jambo...@gmail.com wrote:
Hi all,

I have a three node cluster with identical hardware. I am trying a
workflow
where it reads data from hdfs, repartitions it and runs a few map
operations
then writes the results back to hdfs.

It looks like that all the computation, including the repartitioning and
the
maps complete within similar time intervals on all the nodes, except
when it
writes it back to HDFS when the master node does the job way much faster
then the slaves (15s for each block as opposed to 1.2 min for the
slaves).

Any suggestion what the reason might be?

thanks,

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/writing-to-hdfs-on-master-node-much-faster-tp22570.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: writing to hdfs on master node much faster

2015-04-20 Thread Evo Eftimov

Check whether your partitioning results in balanced partitions ie partitions 
with similar sizes - one of the reasons for the performance differences 
observed by you may be that after your explicit repartitioning, the partition 
on your master node is much smaller than the RDD partitions on the other 2 
nodes  

-Original Message-
From: Sean Owen [mailto:so...@cloudera.com] 
Sent: Monday, April 20, 2015 12:57 PM
To: jamborta
Cc: user@spark.apache.org
Subject: Re: writing to hdfs on master node much faster

What machines are HDFS data nodes -- just your master? that would explain it. 
Otherwise, is it actually the write that's slow or is something else you're 
doing much faster on the master for other reasons maybe? like you're actually 
shipping data via the master first in some local computation? so the master's 
executor has the result much faster?

On Mon, Apr 20, 2015 at 12:21 PM, jamborta jambo...@gmail.com wrote:
 Hi all,

 I have a three node cluster with identical hardware. I am trying a 
 workflow where it reads data from hdfs, repartitions it and runs a few 
 map operations then writes the results back to hdfs.

 It looks like that all the computation, including the repartitioning 
 and the maps complete within similar time intervals on all the nodes, 
 except when it writes it back to HDFS when the master node does the 
 job way much faster then the slaves (15s for each block as opposed to 1.2 min 
 for the slaves).

 Any suggestion what the reason might be?

 thanks,

 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/writing-to-hdfs-on
 -master-node-much-faster-tp22570.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 
 additional commands, e-mail: user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

writing to hdfs on master node much faster

2015-04-20 Thread jamborta

Hi all,

I have a three node cluster with identical hardware. I am trying a workflow
where it reads data from hdfs, repartitions it and runs a few map operations
then writes the results back to hdfs.

It looks like that all the computation, including the repartitioning and the
maps complete within similar time intervals on all the nodes, except when it
writes it back to HDFS when the master node does the job way much faster
then the slaves (15s for each block as opposed to 1.2 min for the slaves). 

Any suggestion what the reason might be?

thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/writing-to-hdfs-on-master-node-much-faster-tp22570.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: writing to hdfs on master node much faster

Re: writing to hdfs on master node much faster

RE: writing to hdfs on master node much faster

writing to hdfs on master node much faster

4 matches

Site Navigation

Mail list logo

Footer information