Hey Chris, Apologies for the delayed reply. Your responses are always insightful and appreciated :-)
However, I have a few more questions. "also, it looks like you're writing to S3 per RDD. you'll want to broaden that out to write DStream batches" I assume you mean "dstream.saveAsTextFiles(....)" vs "rdd.saveAsTextFile(....)". Although looking at the source code DStream.scala, the saveAsTextFiles is simply wrapping a rdd.saveAsTextFile def saveAsTextFiles(prefix: String, suffix: String = "") { val saveFunc = (rdd: RDD[T], time: Time) => { val file = rddToFileName(prefix, suffix, time) rdd.saveAsTextFile(file) } this.foreachRDD(saveFunc) } So it's not clear to me how it would improve the throughput. Also, for your comment "expand even further and write window batches (where the window interval is a multiple of the batch interval)". I still don't quite understand mechanics underneath, for example, what would be the difference between extending the batch interval and adding windowed batches? I presume it has something to do with the processor thread(s) within a receiver? By the way, in the near future, I'll be putting together some performance numbers on a proper deployment, and will be sure to share my findings. Thanks, Mike! On Sat, Mar 21, 2015 at 8:09 AM, Chris Fregly <ch...@fregly.com> wrote: > hey mike! > > you'll definitely want to increase your parallelism by adding more shards > to the stream - as well as spinning up 1 receiver per shard and unioning > all the shards per the KinesisWordCount example that is included with the > kinesis streaming package. > > you'll need more cores (cluster) or threads (local) to support this - > equalling at least the number of shards/receivers + 1. > > also, it looks like you're writing to S3 per RDD. you'll want to broaden > that out to write DStream batches - or expand even further and write > window batches (where the window interval is a multiple of the batch > interval). > > this goes for any spark streaming implementation - not just Kinesis. > > lemme know if that works for you. > > thanks! > > -Chris > _____________________________ > From: Mike Trienis <mike.trie...@orcsol.com> > Sent: Wednesday, March 18, 2015 2:45 PM > Subject: Spark Streaming S3 Performance Implications > To: <user@spark.apache.org> > > > > Hi All, > > I am pushing data from Kinesis stream to S3 using Spark Streaming and > noticed that during testing (i.e. master=local[2]) the batches (1 second > intervals) were falling behind the incoming data stream at about 5-10 > events / second. It seems that the rdd.saveAsTextFile(s3n://...) is taking > at a few seconds to complete. > > val saveFunc = (rdd: RDD[String], time: Time) => { > > val count = rdd.count() > > if (count > 0) { > > val s3BucketInterval = time.milliseconds.toString > > rdd.saveAsTextFile(s3n://...) > > } > } > > dataStream.foreachRDD(saveFunc) > > > Should I expect the same behaviour in a deployed cluster? Or does the > rdd.saveAsTextFile(s3n://...) distribute the push work to each worker node? > > "Write the elements of the dataset as a text file (or set of text files) > in a given directory in the local filesystem, HDFS or any other > Hadoop-supported file system. Spark will call toString on each element to > convert it to a line of text in the file." > > Thanks, Mike. > > >