Re: Big performance difference between client and cluster deployment mode; is this expected?
-dev, +user A decent guess: Does your 'save' function entail collecting data back to the driver? and are you running this from a machine that's not in your Spark cluster? Then in client mode you're shipping data back to a less-nearby machine, compared to with cluster mode. That could explain the bottleneck. On Wed, Dec 31, 2014 at 4:12 PM, Enno Shioji eshi...@gmail.com wrote: Hi, I have a very, very simple streaming job. When I deploy this on the exact same cluster, with the exact same parameters, I see big (40%) performance difference between client and cluster deployment mode. This seems a bit surprising.. Is this expected? The streaming job is: val msgStream = kafkaStream .map { case (k, v) = v} .map(DatatypeConverter.printBase64Binary) .foreachRDD(save) .saveAsTextFile(s3n://some.bucket/path, classOf[LzoCodec]) I tried several times, but the job deployed with client mode can only write at 60% throughput of the job deployed with cluster mode and this happens consistently. I'm logging at INFO level, but my application code doesn't log anything so it's only Spark logs. The logs I see in client mode doesn't seem like a crazy amount. The setup is: spark-ec2 [...] \ --copy-aws-credentials \ --instance-type=m3.2xlarge \ -s 2 launch test_cluster And all the deployment was done from the master machine. ᐧ - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Big performance difference between client and cluster deployment mode; is this expected?
Also the job was deployed from the master machine in the cluster. ᐧ On Wed, Dec 31, 2014 at 6:35 PM, Enno Shioji eshi...@gmail.com wrote: Oh sorry that was a edit mistake. The code is essentially: val msgStream = kafkaStream .map { case (k, v) = v} .map(DatatypeConverter.printBase64Binary) .saveAsTextFile(s3n://some.bucket/path, classOf[LzoCodec]) I.e. there is essentially no original code (I was calling saveAsTextFile in a save function but that was just a remnant from previous debugging). ᐧ On Wed, Dec 31, 2014 at 6:21 PM, Sean Owen so...@cloudera.com wrote: -dev, +user A decent guess: Does your 'save' function entail collecting data back to the driver? and are you running this from a machine that's not in your Spark cluster? Then in client mode you're shipping data back to a less-nearby machine, compared to with cluster mode. That could explain the bottleneck. On Wed, Dec 31, 2014 at 4:12 PM, Enno Shioji eshi...@gmail.com wrote: Hi, I have a very, very simple streaming job. When I deploy this on the exact same cluster, with the exact same parameters, I see big (40%) performance difference between client and cluster deployment mode. This seems a bit surprising.. Is this expected? The streaming job is: val msgStream = kafkaStream .map { case (k, v) = v} .map(DatatypeConverter.printBase64Binary) .foreachRDD(save) .saveAsTextFile(s3n://some.bucket/path, classOf[LzoCodec]) I tried several times, but the job deployed with client mode can only write at 60% throughput of the job deployed with cluster mode and this happens consistently. I'm logging at INFO level, but my application code doesn't log anything so it's only Spark logs. The logs I see in client mode doesn't seem like a crazy amount. The setup is: spark-ec2 [...] \ --copy-aws-credentials \ --instance-type=m3.2xlarge \ -s 2 launch test_cluster And all the deployment was done from the master machine. ᐧ
Re: Big performance difference between client and cluster deployment mode; is this expected?
Oh sorry that was a edit mistake. The code is essentially: val msgStream = kafkaStream .map { case (k, v) = v} .map(DatatypeConverter.printBase64Binary) .saveAsTextFile(s3n://some.bucket/path, classOf[LzoCodec]) I.e. there is essentially no original code (I was calling saveAsTextFile in a save function but that was just a remnant from previous debugging). ᐧ On Wed, Dec 31, 2014 at 6:21 PM, Sean Owen so...@cloudera.com wrote: -dev, +user A decent guess: Does your 'save' function entail collecting data back to the driver? and are you running this from a machine that's not in your Spark cluster? Then in client mode you're shipping data back to a less-nearby machine, compared to with cluster mode. That could explain the bottleneck. On Wed, Dec 31, 2014 at 4:12 PM, Enno Shioji eshi...@gmail.com wrote: Hi, I have a very, very simple streaming job. When I deploy this on the exact same cluster, with the exact same parameters, I see big (40%) performance difference between client and cluster deployment mode. This seems a bit surprising.. Is this expected? The streaming job is: val msgStream = kafkaStream .map { case (k, v) = v} .map(DatatypeConverter.printBase64Binary) .foreachRDD(save) .saveAsTextFile(s3n://some.bucket/path, classOf[LzoCodec]) I tried several times, but the job deployed with client mode can only write at 60% throughput of the job deployed with cluster mode and this happens consistently. I'm logging at INFO level, but my application code doesn't log anything so it's only Spark logs. The logs I see in client mode doesn't seem like a crazy amount. The setup is: spark-ec2 [...] \ --copy-aws-credentials \ --instance-type=m3.2xlarge \ -s 2 launch test_cluster And all the deployment was done from the master machine. ᐧ
Re: Big performance difference between client and cluster deployment mode; is this expected?
Whats your spark-submit commands in both cases? Is it Spark Standalone or YARN (both support client and cluster)? Accordingly what is the number of executors/cores requested? TD On Wed, Dec 31, 2014 at 10:36 AM, Enno Shioji eshi...@gmail.com wrote: Also the job was deployed from the master machine in the cluster. ᐧ On Wed, Dec 31, 2014 at 6:35 PM, Enno Shioji eshi...@gmail.com wrote: Oh sorry that was a edit mistake. The code is essentially: val msgStream = kafkaStream .map { case (k, v) = v} .map(DatatypeConverter.printBase64Binary) .saveAsTextFile(s3n://some.bucket/path, classOf[LzoCodec]) I.e. there is essentially no original code (I was calling saveAsTextFile in a save function but that was just a remnant from previous debugging). ᐧ On Wed, Dec 31, 2014 at 6:21 PM, Sean Owen so...@cloudera.com wrote: -dev, +user A decent guess: Does your 'save' function entail collecting data back to the driver? and are you running this from a machine that's not in your Spark cluster? Then in client mode you're shipping data back to a less-nearby machine, compared to with cluster mode. That could explain the bottleneck. On Wed, Dec 31, 2014 at 4:12 PM, Enno Shioji eshi...@gmail.com wrote: Hi, I have a very, very simple streaming job. When I deploy this on the exact same cluster, with the exact same parameters, I see big (40%) performance difference between client and cluster deployment mode. This seems a bit surprising.. Is this expected? The streaming job is: val msgStream = kafkaStream .map { case (k, v) = v} .map(DatatypeConverter.printBase64Binary) .foreachRDD(save) .saveAsTextFile(s3n://some.bucket/path, classOf[LzoCodec]) I tried several times, but the job deployed with client mode can only write at 60% throughput of the job deployed with cluster mode and this happens consistently. I'm logging at INFO level, but my application code doesn't log anything so it's only Spark logs. The logs I see in client mode doesn't seem like a crazy amount. The setup is: spark-ec2 [...] \ --copy-aws-credentials \ --instance-type=m3.2xlarge \ -s 2 launch test_cluster And all the deployment was done from the master machine. ᐧ
Re: Big performance difference between client and cluster deployment mode; is this expected?
Hi Tathagata, It's a standalone cluster. The submit commands are: == CLIENT spark-submit --class com.fake.Test \ --deploy-mode client --master spark://fake.com:7077 \ fake.jar arguments == CLUSTER spark-submit --class com.fake.Test \ --deploy-mode cluster --master spark://fake.com:7077 \ s3n://fake.jar arguments And they are both occupying all available slots. (8 * 2 machine = 16 slots). ᐧ On Thu, Jan 1, 2015 at 12:21 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Whats your spark-submit commands in both cases? Is it Spark Standalone or YARN (both support client and cluster)? Accordingly what is the number of executors/cores requested? TD On Wed, Dec 31, 2014 at 10:36 AM, Enno Shioji eshi...@gmail.com wrote: Also the job was deployed from the master machine in the cluster. On Wed, Dec 31, 2014 at 6:35 PM, Enno Shioji eshi...@gmail.com wrote: Oh sorry that was a edit mistake. The code is essentially: val msgStream = kafkaStream .map { case (k, v) = v} .map(DatatypeConverter.printBase64Binary) .saveAsTextFile(s3n://some.bucket/path, classOf[LzoCodec]) I.e. there is essentially no original code (I was calling saveAsTextFile in a save function but that was just a remnant from previous debugging). On Wed, Dec 31, 2014 at 6:21 PM, Sean Owen so...@cloudera.com wrote: -dev, +user A decent guess: Does your 'save' function entail collecting data back to the driver? and are you running this from a machine that's not in your Spark cluster? Then in client mode you're shipping data back to a less-nearby machine, compared to with cluster mode. That could explain the bottleneck. On Wed, Dec 31, 2014 at 4:12 PM, Enno Shioji eshi...@gmail.com wrote: Hi, I have a very, very simple streaming job. When I deploy this on the exact same cluster, with the exact same parameters, I see big (40%) performance difference between client and cluster deployment mode. This seems a bit surprising.. Is this expected? The streaming job is: val msgStream = kafkaStream .map { case (k, v) = v} .map(DatatypeConverter.printBase64Binary) .foreachRDD(save) .saveAsTextFile(s3n://some.bucket/path, classOf[LzoCodec]) I tried several times, but the job deployed with client mode can only write at 60% throughput of the job deployed with cluster mode and this happens consistently. I'm logging at INFO level, but my application code doesn't log anything so it's only Spark logs. The logs I see in client mode doesn't seem like a crazy amount. The setup is: spark-ec2 [...] \ --copy-aws-credentials \ --instance-type=m3.2xlarge \ -s 2 launch test_cluster And all the deployment was done from the master machine. ᐧ