Also the job was deployed from the master machine in the cluster. ᐧ On Wed, Dec 31, 2014 at 6:35 PM, Enno Shioji <eshi...@gmail.com> wrote:
> Oh sorry that was a edit mistake. The code is essentially: > > val msgStream = kafkaStream > .map { case (k, v) => v} > .map(DatatypeConverter.printBase64Binary) > .saveAsTextFile("s3n://some.bucket/path", classOf[LzoCodec]) > > I.e. there is essentially no original code (I was calling saveAsTextFile > in a "save" function but that was just a remnant from previous debugging). > > > ᐧ > > On Wed, Dec 31, 2014 at 6:21 PM, Sean Owen <so...@cloudera.com> wrote: > >> -dev, +user >> >> A decent guess: Does your 'save' function entail collecting data back >> to the driver? and are you running this from a machine that's not in >> your Spark cluster? Then in client mode you're shipping data back to a >> less-nearby machine, compared to with cluster mode. That could explain >> the bottleneck. >> >> On Wed, Dec 31, 2014 at 4:12 PM, Enno Shioji <eshi...@gmail.com> wrote: >> > Hi, >> > >> > I have a very, very simple streaming job. When I deploy this on the >> exact >> > same cluster, with the exact same parameters, I see big (40%) >> performance >> > difference between "client" and "cluster" deployment mode. This seems a >> bit >> > surprising.. Is this expected? >> > >> > The streaming job is: >> > >> > val msgStream = kafkaStream >> > .map { case (k, v) => v} >> > .map(DatatypeConverter.printBase64Binary) >> > .foreachRDD(save) >> > .saveAsTextFile("s3n://some.bucket/path", classOf[LzoCodec]) >> > >> > I tried several times, but the job deployed with "client" mode can only >> > write at 60% throughput of the job deployed with "cluster" mode and this >> > happens consistently. I'm logging at INFO level, but my application code >> > doesn't log anything so it's only Spark logs. The logs I see in "client" >> > mode doesn't seem like a crazy amount. >> > >> > The setup is: >> > spark-ec2 [...] \ >> > --copy-aws-credentials \ >> > --instance-type=m3.2xlarge \ >> > -s 2 launch test_cluster >> > >> > And all the deployment was done from the master machine. >> > >> > ᐧ >> > >