Re: Big performance difference between client and cluster deployment mode; is this expected?

2014-12-31 Thread Sean Owen
-dev, +user

A decent guess: Does your 'save' function entail collecting data back
to the driver? and are you running this from a machine that's not in
your Spark cluster? Then in client mode you're shipping data back to a
less-nearby machine, compared to with cluster mode. That could explain
the bottleneck.

On Wed, Dec 31, 2014 at 4:12 PM, Enno Shioji eshi...@gmail.com wrote:
 Hi,

 I have a very, very simple streaming job. When I deploy this on the exact
 same cluster, with the exact same parameters, I see big (40%) performance
 difference between client and cluster deployment mode. This seems a bit
 surprising.. Is this expected?

 The streaming job is:

 val msgStream = kafkaStream
   .map { case (k, v) = v}
   .map(DatatypeConverter.printBase64Binary)
   .foreachRDD(save)
   .saveAsTextFile(s3n://some.bucket/path, classOf[LzoCodec])

 I tried several times, but the job deployed with client mode can only
 write at 60% throughput of the job deployed with cluster mode and this
 happens consistently. I'm logging at INFO level, but my application code
 doesn't log anything so it's only Spark logs. The logs I see in client
 mode doesn't seem like a crazy amount.

 The setup is:
 spark-ec2 [...] \
   --copy-aws-credentials \
   --instance-type=m3.2xlarge \
   -s 2 launch test_cluster

 And all the deployment was done from the master machine.

 ᐧ

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Big performance difference between client and cluster deployment mode; is this expected?

2014-12-31 Thread Enno Shioji
Also the job was deployed from the master machine in the cluster.
ᐧ

On Wed, Dec 31, 2014 at 6:35 PM, Enno Shioji eshi...@gmail.com wrote:

 Oh sorry that was a edit mistake. The code is essentially:

  val msgStream = kafkaStream
.map { case (k, v) = v}
.map(DatatypeConverter.printBase64Binary)
.saveAsTextFile(s3n://some.bucket/path, classOf[LzoCodec])

 I.e. there is essentially no original code (I was calling saveAsTextFile
 in a save function but that was just a remnant from previous debugging).


 ᐧ

 On Wed, Dec 31, 2014 at 6:21 PM, Sean Owen so...@cloudera.com wrote:

 -dev, +user

 A decent guess: Does your 'save' function entail collecting data back
 to the driver? and are you running this from a machine that's not in
 your Spark cluster? Then in client mode you're shipping data back to a
 less-nearby machine, compared to with cluster mode. That could explain
 the bottleneck.

 On Wed, Dec 31, 2014 at 4:12 PM, Enno Shioji eshi...@gmail.com wrote:
  Hi,
 
  I have a very, very simple streaming job. When I deploy this on the
 exact
  same cluster, with the exact same parameters, I see big (40%)
 performance
  difference between client and cluster deployment mode. This seems a
 bit
  surprising.. Is this expected?
 
  The streaming job is:
 
  val msgStream = kafkaStream
.map { case (k, v) = v}
.map(DatatypeConverter.printBase64Binary)
.foreachRDD(save)
.saveAsTextFile(s3n://some.bucket/path, classOf[LzoCodec])
 
  I tried several times, but the job deployed with client mode can only
  write at 60% throughput of the job deployed with cluster mode and this
  happens consistently. I'm logging at INFO level, but my application code
  doesn't log anything so it's only Spark logs. The logs I see in client
  mode doesn't seem like a crazy amount.
 
  The setup is:
  spark-ec2 [...] \
--copy-aws-credentials \
--instance-type=m3.2xlarge \
-s 2 launch test_cluster
 
  And all the deployment was done from the master machine.
 
  ᐧ





Re: Big performance difference between client and cluster deployment mode; is this expected?

2014-12-31 Thread Enno Shioji
Oh sorry that was a edit mistake. The code is essentially:

 val msgStream = kafkaStream
   .map { case (k, v) = v}
   .map(DatatypeConverter.printBase64Binary)
   .saveAsTextFile(s3n://some.bucket/path, classOf[LzoCodec])

I.e. there is essentially no original code (I was calling saveAsTextFile in
a save function but that was just a remnant from previous debugging).


ᐧ

On Wed, Dec 31, 2014 at 6:21 PM, Sean Owen so...@cloudera.com wrote:

 -dev, +user

 A decent guess: Does your 'save' function entail collecting data back
 to the driver? and are you running this from a machine that's not in
 your Spark cluster? Then in client mode you're shipping data back to a
 less-nearby machine, compared to with cluster mode. That could explain
 the bottleneck.

 On Wed, Dec 31, 2014 at 4:12 PM, Enno Shioji eshi...@gmail.com wrote:
  Hi,
 
  I have a very, very simple streaming job. When I deploy this on the exact
  same cluster, with the exact same parameters, I see big (40%) performance
  difference between client and cluster deployment mode. This seems a
 bit
  surprising.. Is this expected?
 
  The streaming job is:
 
  val msgStream = kafkaStream
.map { case (k, v) = v}
.map(DatatypeConverter.printBase64Binary)
.foreachRDD(save)
.saveAsTextFile(s3n://some.bucket/path, classOf[LzoCodec])
 
  I tried several times, but the job deployed with client mode can only
  write at 60% throughput of the job deployed with cluster mode and this
  happens consistently. I'm logging at INFO level, but my application code
  doesn't log anything so it's only Spark logs. The logs I see in client
  mode doesn't seem like a crazy amount.
 
  The setup is:
  spark-ec2 [...] \
--copy-aws-credentials \
--instance-type=m3.2xlarge \
-s 2 launch test_cluster
 
  And all the deployment was done from the master machine.
 
  ᐧ



Re: Big performance difference between client and cluster deployment mode; is this expected?

2014-12-31 Thread Tathagata Das
Whats your spark-submit commands in both cases? Is it Spark Standalone or
YARN (both support client and cluster)? Accordingly what is the number of
executors/cores requested?

TD

On Wed, Dec 31, 2014 at 10:36 AM, Enno Shioji eshi...@gmail.com wrote:

 Also the job was deployed from the master machine in the cluster.
 ᐧ

 On Wed, Dec 31, 2014 at 6:35 PM, Enno Shioji eshi...@gmail.com wrote:

 Oh sorry that was a edit mistake. The code is essentially:

  val msgStream = kafkaStream
.map { case (k, v) = v}
.map(DatatypeConverter.printBase64Binary)
.saveAsTextFile(s3n://some.bucket/path, classOf[LzoCodec])

 I.e. there is essentially no original code (I was calling saveAsTextFile
 in a save function but that was just a remnant from previous debugging).


 ᐧ

 On Wed, Dec 31, 2014 at 6:21 PM, Sean Owen so...@cloudera.com wrote:

 -dev, +user

 A decent guess: Does your 'save' function entail collecting data back
 to the driver? and are you running this from a machine that's not in
 your Spark cluster? Then in client mode you're shipping data back to a
 less-nearby machine, compared to with cluster mode. That could explain
 the bottleneck.

 On Wed, Dec 31, 2014 at 4:12 PM, Enno Shioji eshi...@gmail.com wrote:
  Hi,
 
  I have a very, very simple streaming job. When I deploy this on the
 exact
  same cluster, with the exact same parameters, I see big (40%)
 performance
  difference between client and cluster deployment mode. This seems
 a bit
  surprising.. Is this expected?
 
  The streaming job is:
 
  val msgStream = kafkaStream
.map { case (k, v) = v}
.map(DatatypeConverter.printBase64Binary)
.foreachRDD(save)
.saveAsTextFile(s3n://some.bucket/path, classOf[LzoCodec])
 
  I tried several times, but the job deployed with client mode can only
  write at 60% throughput of the job deployed with cluster mode and
 this
  happens consistently. I'm logging at INFO level, but my application
 code
  doesn't log anything so it's only Spark logs. The logs I see in
 client
  mode doesn't seem like a crazy amount.
 
  The setup is:
  spark-ec2 [...] \
--copy-aws-credentials \
--instance-type=m3.2xlarge \
-s 2 launch test_cluster
 
  And all the deployment was done from the master machine.
 
  ᐧ






Re: Big performance difference between client and cluster deployment mode; is this expected?

2014-12-31 Thread Enno Shioji
Hi Tathagata,

It's a standalone cluster. The submit commands are:

== CLIENT
spark-submit --class com.fake.Test \
--deploy-mode client --master spark://fake.com:7077 \
fake.jar arguments

== CLUSTER
 spark-submit --class com.fake.Test \
 --deploy-mode cluster --master spark://fake.com:7077 \
 s3n://fake.jar arguments

And they are both occupying all available slots. (8 * 2 machine = 16 slots).


ᐧ

On Thu, Jan 1, 2015 at 12:21 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:

 Whats your spark-submit commands in both cases? Is it Spark Standalone or
 YARN (both support client and cluster)? Accordingly what is the number of
 executors/cores requested?

 TD

 On Wed, Dec 31, 2014 at 10:36 AM, Enno Shioji eshi...@gmail.com wrote:

 Also the job was deployed from the master machine in the cluster.

 On Wed, Dec 31, 2014 at 6:35 PM, Enno Shioji eshi...@gmail.com wrote:

 Oh sorry that was a edit mistake. The code is essentially:

  val msgStream = kafkaStream
.map { case (k, v) = v}
.map(DatatypeConverter.printBase64Binary)
.saveAsTextFile(s3n://some.bucket/path, classOf[LzoCodec])

 I.e. there is essentially no original code (I was calling saveAsTextFile
 in a save function but that was just a remnant from previous debugging).



 On Wed, Dec 31, 2014 at 6:21 PM, Sean Owen so...@cloudera.com wrote:

 -dev, +user

 A decent guess: Does your 'save' function entail collecting data back
 to the driver? and are you running this from a machine that's not in
 your Spark cluster? Then in client mode you're shipping data back to a
 less-nearby machine, compared to with cluster mode. That could explain
 the bottleneck.

 On Wed, Dec 31, 2014 at 4:12 PM, Enno Shioji eshi...@gmail.com wrote:
  Hi,
 
  I have a very, very simple streaming job. When I deploy this on the
 exact
  same cluster, with the exact same parameters, I see big (40%)
 performance
  difference between client and cluster deployment mode. This seems
 a bit
  surprising.. Is this expected?
 
  The streaming job is:
 
  val msgStream = kafkaStream
.map { case (k, v) = v}
.map(DatatypeConverter.printBase64Binary)
.foreachRDD(save)
.saveAsTextFile(s3n://some.bucket/path, classOf[LzoCodec])
 
  I tried several times, but the job deployed with client mode can
 only
  write at 60% throughput of the job deployed with cluster mode and
 this
  happens consistently. I'm logging at INFO level, but my application
 code
  doesn't log anything so it's only Spark logs. The logs I see in
 client
  mode doesn't seem like a crazy amount.
 
  The setup is:
  spark-ec2 [...] \
--copy-aws-credentials \
--instance-type=m3.2xlarge \
-s 2 launch test_cluster
 
  And all the deployment was done from the master machine.
 
  ᐧ