Re: Spark Streaming S3 Performance Implications

Chris Fregly Sat, 21 Mar 2015 08:10:50 -0700

hey mike!
you'll definitely want to increase your parallelism by adding more shards to 
the stream - as well as spinning up 1 receiver per shard and unioning all the 
shards per the KinesisWordCount example that is included with the kinesis 
streaming package. 
you'll need more cores (cluster) or threads (local) to support this - equalling 
at least the number of shards/receivers + 1.
also, it looks like you're writing to S3 per RDD.  you'll want to broaden that 
out to write DStream batches - or expand  even further and write window batches 
(where the window interval is a multiple of the batch interval).
this goes for any spark streaming implementation - not just Kinesis.
lemme know if that works for you.
thanks!
-Chris 
    _____________________________
From: Mike Trienis <mike.trie...@orcsol.com>
Sent: Wednesday, March 18, 2015 2:45 PM
Subject: Spark Streaming S3 Performance Implications
To:  <user@spark.apache.org>



       Hi All,       
          I am pushing data from Kinesis stream to S3 using Spark Streaming and 
noticed that during testing (i.e. master=local[2]) the batches (1 second 
intervals) were falling behind the incoming data stream at about 5-10 events / 
second. It seems that the rdd.saveAsTextFile(s3n://...) is taking at a few 
seconds to complete.           
                       val saveFunc = (rdd: RDD[String], time: Time) => {       
      
                         val count = rdd.count()             
                         if (count > 0) {             
                             val s3BucketInterval = time.milliseconds.toString  
           
                            rdd.saveAsTextFile(s3n://...)                       
                }                     }             
                     dataStream.foreachRDD(saveFunc)              
          
          Should I expect the same behaviour in a deployed cluster? Or does the 
rdd.saveAsTextFile(s3n://...) distribute the push work to each worker node?     
     
          "Write the elements of the dataset as a text file (or set of text 
files) in a given directory in the local filesystem, HDFS or any other 
Hadoop-supported file system. Spark will call toString on each element to 
convert it to a line of text in the file."          
          Thanks, Mike.

Re: Spark Streaming S3 Performance Implications

Reply via email to