Oh sorry that was a edit mistake. The code is essentially: val msgStream = kafkaStream .map { case (k, v) => v} .map(DatatypeConverter.printBase64Binary) .saveAsTextFile("s3n://some.bucket/path", classOf[LzoCodec])
I.e. there is essentially no original code (I was calling saveAsTextFile in a "save" function but that was just a remnant from previous debugging). ᐧ On Wed, Dec 31, 2014 at 6:21 PM, Sean Owen <so...@cloudera.com> wrote: > -dev, +user > > A decent guess: Does your 'save' function entail collecting data back > to the driver? and are you running this from a machine that's not in > your Spark cluster? Then in client mode you're shipping data back to a > less-nearby machine, compared to with cluster mode. That could explain > the bottleneck. > > On Wed, Dec 31, 2014 at 4:12 PM, Enno Shioji <eshi...@gmail.com> wrote: > > Hi, > > > > I have a very, very simple streaming job. When I deploy this on the exact > > same cluster, with the exact same parameters, I see big (40%) performance > > difference between "client" and "cluster" deployment mode. This seems a > bit > > surprising.. Is this expected? > > > > The streaming job is: > > > > val msgStream = kafkaStream > > .map { case (k, v) => v} > > .map(DatatypeConverter.printBase64Binary) > > .foreachRDD(save) > > .saveAsTextFile("s3n://some.bucket/path", classOf[LzoCodec]) > > > > I tried several times, but the job deployed with "client" mode can only > > write at 60% throughput of the job deployed with "cluster" mode and this > > happens consistently. I'm logging at INFO level, but my application code > > doesn't log anything so it's only Spark logs. The logs I see in "client" > > mode doesn't seem like a crazy amount. > > > > The setup is: > > spark-ec2 [...] \ > > --copy-aws-credentials \ > > --instance-type=m3.2xlarge \ > > -s 2 launch test_cluster > > > > And all the deployment was done from the master machine. > > > > ᐧ >