Re: Spark Streaming S3 Performance Implications

Mike Trienis Wed, 01 Apr 2015 10:47:17 -0700

Hey Chris,

Apologies for the delayed reply. Your responses are always insightful and
appreciated :-)


However, I have a few more questions.

"also, it looks like you're writing to S3 per RDD.  you'll want to broaden
that out to write DStream batches"

I assume you mean "dstream.saveAsTextFiles(....)" vs
"rdd.saveAsTextFile(....)". Although looking at the source code
DStream.scala, the saveAsTextFiles is simply wrapping a rdd.saveAsTextFile

  def saveAsTextFiles(prefix: String, suffix: String = "") {
    val saveFunc = (rdd: RDD[T], time: Time) => {
      val file = rddToFileName(prefix, suffix, time)
      rdd.saveAsTextFile(file)
    }
    this.foreachRDD(saveFunc)
  }

So it's not clear to me how it would improve the throughput.

Also, for your comment "expand even further and write window batches (where
the window interval is a multiple of the batch interval)". I still don't
quite understand mechanics underneath, for example, what would be the
difference between extending the batch interval and adding windowed
batches? I presume it has something to do with the processor thread(s)
within a receiver?

By the way, in the near future, I'll be putting together some performance
numbers on a proper deployment, and will be sure to share my findings.

Thanks, Mike!



On Sat, Mar 21, 2015 at 8:09 AM, Chris Fregly <ch...@fregly.com> wrote:

> hey mike!
>
> you'll definitely want to increase your parallelism by adding more shards
> to the stream - as well as spinning up 1 receiver per shard and unioning
> all the shards per the KinesisWordCount example that is included with the
> kinesis streaming package.
>
> you'll need more cores (cluster) or threads (local) to support this -
> equalling at least the number of shards/receivers + 1.
>
> also, it looks like you're writing to S3 per RDD.  you'll want to broaden
> that out to write DStream batches - or expand  even further and write
> window batches (where the window interval is a multiple of the batch
> interval).
>
> this goes for any spark streaming implementation - not just Kinesis.
>
> lemme know if that works for you.
>
> thanks!
>
> -Chris
> _____________________________
> From: Mike Trienis <mike.trie...@orcsol.com>
> Sent: Wednesday, March 18, 2015 2:45 PM
> Subject: Spark Streaming S3 Performance Implications
> To: <user@spark.apache.org>
>
>
>
>  Hi All,
>
>  I am pushing data from Kinesis stream to S3 using Spark Streaming and
> noticed that during testing (i.e. master=local[2]) the batches (1 second
> intervals) were falling behind the incoming data stream at about 5-10
> events / second. It seems that the rdd.saveAsTextFile(s3n://...) is taking
> at a few seconds to complete.
>
>           val saveFunc = (rdd: RDD[String], time: Time) => {
>
>              val count = rdd.count()
>
>              if (count > 0) {
>
>                  val s3BucketInterval = time.milliseconds.toString
>
>                 rdd.saveAsTextFile(s3n://...)
>
>              }
>          }
>
>          dataStream.foreachRDD(saveFunc)
>
>
>  Should I expect the same behaviour in a deployed cluster? Or does the
> rdd.saveAsTextFile(s3n://...) distribute the push work to each worker node?
>
>  "Write the elements of the dataset as a text file (or set of text files)
> in a given directory in the local filesystem, HDFS or any other
> Hadoop-supported file system. Spark will call toString on each element to
> convert it to a line of text in the file."
>
>  Thanks, Mike.
>
>
>

Re: Spark Streaming S3 Performance Implications

Reply via email to