Re: Spark Streaming S3 Performance Implications
Hey Chris, Apologies for the delayed reply. Your responses are always insightful and appreciated :-) However, I have a few more questions. "also, it looks like you're writing to S3 per RDD. you'll want to broaden that out to write DStream batches" I assume you mean "dstream.saveAsTextFiles()" vs "rdd.saveAsTextFile()". Although looking at the source code DStream.scala, the saveAsTextFiles is simply wrapping a rdd.saveAsTextFile def saveAsTextFiles(prefix: String, suffix: String = "") { val saveFunc = (rdd: RDD[T], time: Time) => { val file = rddToFileName(prefix, suffix, time) rdd.saveAsTextFile(file) } this.foreachRDD(saveFunc) } So it's not clear to me how it would improve the throughput. Also, for your comment "expand even further and write window batches (where the window interval is a multiple of the batch interval)". I still don't quite understand mechanics underneath, for example, what would be the difference between extending the batch interval and adding windowed batches? I presume it has something to do with the processor thread(s) within a receiver? By the way, in the near future, I'll be putting together some performance numbers on a proper deployment, and will be sure to share my findings. Thanks, Mike! On Sat, Mar 21, 2015 at 8:09 AM, Chris Fregly wrote: > hey mike! > > you'll definitely want to increase your parallelism by adding more shards > to the stream - as well as spinning up 1 receiver per shard and unioning > all the shards per the KinesisWordCount example that is included with the > kinesis streaming package. > > you'll need more cores (cluster) or threads (local) to support this - > equalling at least the number of shards/receivers + 1. > > also, it looks like you're writing to S3 per RDD. you'll want to broaden > that out to write DStream batches - or expand even further and write > window batches (where the window interval is a multiple of the batch > interval). > > this goes for any spark streaming implementation - not just Kinesis. > > lemme know if that works for you. > > thanks! > > -Chris > _ > From: Mike Trienis > Sent: Wednesday, March 18, 2015 2:45 PM > Subject: Spark Streaming S3 Performance Implications > To: > > > > Hi All, > > I am pushing data from Kinesis stream to S3 using Spark Streaming and > noticed that during testing (i.e. master=local[2]) the batches (1 second > intervals) were falling behind the incoming data stream at about 5-10 > events / second. It seems that the rdd.saveAsTextFile(s3n://...) is taking > at a few seconds to complete. > > val saveFunc = (rdd: RDD[String], time: Time) => { > > val count = rdd.count() > > if (count > 0) { > > val s3BucketInterval = time.milliseconds.toString > > rdd.saveAsTextFile(s3n://...) > > } > } > > dataStream.foreachRDD(saveFunc) > > > Should I expect the same behaviour in a deployed cluster? Or does the > rdd.saveAsTextFile(s3n://...) distribute the push work to each worker node? > > "Write the elements of the dataset as a text file (or set of text files) > in a given directory in the local filesystem, HDFS or any other > Hadoop-supported file system. Spark will call toString on each element to > convert it to a line of text in the file." > > Thanks, Mike. > > >
Re: Spark Streaming S3 Performance Implications
Mike: Once hadoop 2.7.0 is released, you should be able to enjoy the enhanced performance of s3a. See HADOOP-11571 Cheers On Sat, Mar 21, 2015 at 8:09 AM, Chris Fregly wrote: > hey mike! > > you'll definitely want to increase your parallelism by adding more shards > to the stream - as well as spinning up 1 receiver per shard and unioning > all the shards per the KinesisWordCount example that is included with the > kinesis streaming package. > > you'll need more cores (cluster) or threads (local) to support this - > equalling at least the number of shards/receivers + 1. > > also, it looks like you're writing to S3 per RDD. you'll want to broaden > that out to write DStream batches - or expand even further and write > window batches (where the window interval is a multiple of the batch > interval). > > this goes for any spark streaming implementation - not just Kinesis. > > lemme know if that works for you. > > thanks! > > -Chris > _ > From: Mike Trienis > Sent: Wednesday, March 18, 2015 2:45 PM > Subject: Spark Streaming S3 Performance Implications > To: > > > > Hi All, > > I am pushing data from Kinesis stream to S3 using Spark Streaming and > noticed that during testing (i.e. master=local[2]) the batches (1 second > intervals) were falling behind the incoming data stream at about 5-10 > events / second. It seems that the rdd.saveAsTextFile(s3n://...) is taking > at a few seconds to complete. > > val saveFunc = (rdd: RDD[String], time: Time) => { > > val count = rdd.count() > > if (count > 0) { > > val s3BucketInterval = time.milliseconds.toString > > rdd.saveAsTextFile(s3n://...) > > } > } > > dataStream.foreachRDD(saveFunc) > > > Should I expect the same behaviour in a deployed cluster? Or does the > rdd.saveAsTextFile(s3n://...) distribute the push work to each worker node? > > "Write the elements of the dataset as a text file (or set of text files) > in a given directory in the local filesystem, HDFS or any other > Hadoop-supported file system. Spark will call toString on each element to > convert it to a line of text in the file." > > Thanks, Mike. > > >
Re: Spark Streaming S3 Performance Implications
hey mike! you'll definitely want to increase your parallelism by adding more shards to the stream - as well as spinning up 1 receiver per shard and unioning all the shards per the KinesisWordCount example that is included with the kinesis streaming package. you'll need more cores (cluster) or threads (local) to support this - equalling at least the number of shards/receivers + 1. also, it looks like you're writing to S3 per RDD. you'll want to broaden that out to write DStream batches - or expand even further and write window batches (where the window interval is a multiple of the batch interval). this goes for any spark streaming implementation - not just Kinesis. lemme know if that works for you. thanks! -Chris _ From: Mike Trienis Sent: Wednesday, March 18, 2015 2:45 PM Subject: Spark Streaming S3 Performance Implications To: Hi All, I am pushing data from Kinesis stream to S3 using Spark Streaming and noticed that during testing (i.e. master=local[2]) the batches (1 second intervals) were falling behind the incoming data stream at about 5-10 events / second. It seems that the rdd.saveAsTextFile(s3n://...) is taking at a few seconds to complete. val saveFunc = (rdd: RDD[String], time: Time) => { val count = rdd.count() if (count > 0) { val s3BucketInterval = time.milliseconds.toString rdd.saveAsTextFile(s3n://...) } } dataStream.foreachRDD(saveFunc) Should I expect the same behaviour in a deployed cluster? Or does the rdd.saveAsTextFile(s3n://...) distribute the push work to each worker node? "Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file." Thanks, Mike.