Mike: Once hadoop 2.7.0 is released, you should be able to enjoy the enhanced performance of s3a. See HADOOP-11571
Cheers On Sat, Mar 21, 2015 at 8:09 AM, Chris Fregly <ch...@fregly.com> wrote: > hey mike! > > you'll definitely want to increase your parallelism by adding more shards > to the stream - as well as spinning up 1 receiver per shard and unioning > all the shards per the KinesisWordCount example that is included with the > kinesis streaming package. > > you'll need more cores (cluster) or threads (local) to support this - > equalling at least the number of shards/receivers + 1. > > also, it looks like you're writing to S3 per RDD. you'll want to broaden > that out to write DStream batches - or expand even further and write > window batches (where the window interval is a multiple of the batch > interval). > > this goes for any spark streaming implementation - not just Kinesis. > > lemme know if that works for you. > > thanks! > > -Chris > _____________________________ > From: Mike Trienis <mike.trie...@orcsol.com> > Sent: Wednesday, March 18, 2015 2:45 PM > Subject: Spark Streaming S3 Performance Implications > To: <user@spark.apache.org> > > > > Hi All, > > I am pushing data from Kinesis stream to S3 using Spark Streaming and > noticed that during testing (i.e. master=local[2]) the batches (1 second > intervals) were falling behind the incoming data stream at about 5-10 > events / second. It seems that the rdd.saveAsTextFile(s3n://...) is taking > at a few seconds to complete. > > val saveFunc = (rdd: RDD[String], time: Time) => { > > val count = rdd.count() > > if (count > 0) { > > val s3BucketInterval = time.milliseconds.toString > > rdd.saveAsTextFile(s3n://...) > > } > } > > dataStream.foreachRDD(saveFunc) > > > Should I expect the same behaviour in a deployed cluster? Or does the > rdd.saveAsTextFile(s3n://...) distribute the push work to each worker node? > > "Write the elements of the dataset as a text file (or set of text files) > in a given directory in the local filesystem, HDFS or any other > Hadoop-supported file system. Spark will call toString on each element to > convert it to a line of text in the file." > > Thanks, Mike. > > >