Also the job was deployed from the master machine in the cluster.
ᐧ

On Wed, Dec 31, 2014 at 6:35 PM, Enno Shioji <eshi...@gmail.com> wrote:

> Oh sorry that was a edit mistake. The code is essentially:
>
>      val msgStream = kafkaStream
>        .map { case (k, v) => v}
>        .map(DatatypeConverter.printBase64Binary)
>        .saveAsTextFile("s3n://some.bucket/path", classOf[LzoCodec])
>
> I.e. there is essentially no original code (I was calling saveAsTextFile
> in a "save" function but that was just a remnant from previous debugging).
>
>
> ᐧ
>
> On Wed, Dec 31, 2014 at 6:21 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> -dev, +user
>>
>> A decent guess: Does your 'save' function entail collecting data back
>> to the driver? and are you running this from a machine that's not in
>> your Spark cluster? Then in client mode you're shipping data back to a
>> less-nearby machine, compared to with cluster mode. That could explain
>> the bottleneck.
>>
>> On Wed, Dec 31, 2014 at 4:12 PM, Enno Shioji <eshi...@gmail.com> wrote:
>> > Hi,
>> >
>> > I have a very, very simple streaming job. When I deploy this on the
>> exact
>> > same cluster, with the exact same parameters, I see big (40%)
>> performance
>> > difference between "client" and "cluster" deployment mode. This seems a
>> bit
>> > surprising.. Is this expected?
>> >
>> > The streaming job is:
>> >
>> >     val msgStream = kafkaStream
>> >       .map { case (k, v) => v}
>> >       .map(DatatypeConverter.printBase64Binary)
>> >       .foreachRDD(save)
>> >       .saveAsTextFile("s3n://some.bucket/path", classOf[LzoCodec])
>> >
>> > I tried several times, but the job deployed with "client" mode can only
>> > write at 60% throughput of the job deployed with "cluster" mode and this
>> > happens consistently. I'm logging at INFO level, but my application code
>> > doesn't log anything so it's only Spark logs. The logs I see in "client"
>> > mode doesn't seem like a crazy amount.
>> >
>> > The setup is:
>> > spark-ec2 [...] \
>> >   --copy-aws-credentials \
>> >   --instance-type=m3.2xlarge \
>> >   -s 2 launch test_cluster
>> >
>> > And all the deployment was done from the master machine.
>> >
>> > ᐧ
>>
>
>

Reply via email to