Re: Spark Streaming S3 Performance Implications

2015-04-01 Thread Mike Trienis
Hey Chris,

Apologies for the delayed reply. Your responses are always insightful and
appreciated :-)

However, I have a few more questions.

"also, it looks like you're writing to S3 per RDD.  you'll want to broaden
that out to write DStream batches"

I assume you mean "dstream.saveAsTextFiles()" vs
"rdd.saveAsTextFile()". Although looking at the source code
DStream.scala, the saveAsTextFiles is simply wrapping a rdd.saveAsTextFile

  def saveAsTextFiles(prefix: String, suffix: String = "") {
val saveFunc = (rdd: RDD[T], time: Time) => {
  val file = rddToFileName(prefix, suffix, time)
  rdd.saveAsTextFile(file)
}
this.foreachRDD(saveFunc)
  }

So it's not clear to me how it would improve the throughput.

Also, for your comment "expand even further and write window batches (where
the window interval is a multiple of the batch interval)". I still don't
quite understand mechanics underneath, for example, what would be the
difference between extending the batch interval and adding windowed
batches? I presume it has something to do with the processor thread(s)
within a receiver?

By the way, in the near future, I'll be putting together some performance
numbers on a proper deployment, and will be sure to share my findings.

Thanks, Mike!



On Sat, Mar 21, 2015 at 8:09 AM, Chris Fregly  wrote:

> hey mike!
>
> you'll definitely want to increase your parallelism by adding more shards
> to the stream - as well as spinning up 1 receiver per shard and unioning
> all the shards per the KinesisWordCount example that is included with the
> kinesis streaming package.
>
> you'll need more cores (cluster) or threads (local) to support this -
> equalling at least the number of shards/receivers + 1.
>
> also, it looks like you're writing to S3 per RDD.  you'll want to broaden
> that out to write DStream batches - or expand  even further and write
> window batches (where the window interval is a multiple of the batch
> interval).
>
> this goes for any spark streaming implementation - not just Kinesis.
>
> lemme know if that works for you.
>
> thanks!
>
> -Chris
> _
> From: Mike Trienis 
> Sent: Wednesday, March 18, 2015 2:45 PM
> Subject: Spark Streaming S3 Performance Implications
> To: 
>
>
>
>  Hi All,
>
>  I am pushing data from Kinesis stream to S3 using Spark Streaming and
> noticed that during testing (i.e. master=local[2]) the batches (1 second
> intervals) were falling behind the incoming data stream at about 5-10
> events / second. It seems that the rdd.saveAsTextFile(s3n://...) is taking
> at a few seconds to complete.
>
>   val saveFunc = (rdd: RDD[String], time: Time) => {
>
>  val count = rdd.count()
>
>  if (count > 0) {
>
>  val s3BucketInterval = time.milliseconds.toString
>
> rdd.saveAsTextFile(s3n://...)
>
>  }
>  }
>
>  dataStream.foreachRDD(saveFunc)
>
>
>  Should I expect the same behaviour in a deployed cluster? Or does the
> rdd.saveAsTextFile(s3n://...) distribute the push work to each worker node?
>
>  "Write the elements of the dataset as a text file (or set of text files)
> in a given directory in the local filesystem, HDFS or any other
> Hadoop-supported file system. Spark will call toString on each element to
> convert it to a line of text in the file."
>
>  Thanks, Mike.
>
>
>


Re: Spark Streaming S3 Performance Implications

2015-03-21 Thread Ted Yu
Mike:
Once hadoop 2.7.0 is released, you should be able to enjoy the enhanced
performance of s3a.
See HADOOP-11571

Cheers

On Sat, Mar 21, 2015 at 8:09 AM, Chris Fregly  wrote:

> hey mike!
>
> you'll definitely want to increase your parallelism by adding more shards
> to the stream - as well as spinning up 1 receiver per shard and unioning
> all the shards per the KinesisWordCount example that is included with the
> kinesis streaming package.
>
> you'll need more cores (cluster) or threads (local) to support this -
> equalling at least the number of shards/receivers + 1.
>
> also, it looks like you're writing to S3 per RDD.  you'll want to broaden
> that out to write DStream batches - or expand  even further and write
> window batches (where the window interval is a multiple of the batch
> interval).
>
> this goes for any spark streaming implementation - not just Kinesis.
>
> lemme know if that works for you.
>
> thanks!
>
> -Chris
> _
> From: Mike Trienis 
> Sent: Wednesday, March 18, 2015 2:45 PM
> Subject: Spark Streaming S3 Performance Implications
> To: 
>
>
>
>  Hi All,
>
>  I am pushing data from Kinesis stream to S3 using Spark Streaming and
> noticed that during testing (i.e. master=local[2]) the batches (1 second
> intervals) were falling behind the incoming data stream at about 5-10
> events / second. It seems that the rdd.saveAsTextFile(s3n://...) is taking
> at a few seconds to complete.
>
>   val saveFunc = (rdd: RDD[String], time: Time) => {
>
>  val count = rdd.count()
>
>  if (count > 0) {
>
>  val s3BucketInterval = time.milliseconds.toString
>
> rdd.saveAsTextFile(s3n://...)
>
>  }
>  }
>
>  dataStream.foreachRDD(saveFunc)
>
>
>  Should I expect the same behaviour in a deployed cluster? Or does the
> rdd.saveAsTextFile(s3n://...) distribute the push work to each worker node?
>
>  "Write the elements of the dataset as a text file (or set of text files)
> in a given directory in the local filesystem, HDFS or any other
> Hadoop-supported file system. Spark will call toString on each element to
> convert it to a line of text in the file."
>
>  Thanks, Mike.
>
>
>


Re: Spark Streaming S3 Performance Implications

2015-03-21 Thread Chris Fregly
hey mike!
you'll definitely want to increase your parallelism by adding more shards to 
the stream - as well as spinning up 1 receiver per shard and unioning all the 
shards per the KinesisWordCount example that is included with the kinesis 
streaming package. 
you'll need more cores (cluster) or threads (local) to support this - equalling 
at least the number of shards/receivers + 1.
also, it looks like you're writing to S3 per RDD.  you'll want to broaden that 
out to write DStream batches - or expand  even further and write window batches 
(where the window interval is a multiple of the batch interval).
this goes for any spark streaming implementation - not just Kinesis.
lemme know if that works for you.
thanks!
-Chris 
_
From: Mike Trienis 
Sent: Wednesday, March 18, 2015 2:45 PM
Subject: Spark Streaming S3 Performance Implications
To:  


   Hi All,   
  I am pushing data from Kinesis stream to S3 using Spark Streaming and 
noticed that during testing (i.e. master=local[2]) the batches (1 second 
intervals) were falling behind the incoming data stream at about 5-10 events / 
second. It seems that the rdd.saveAsTextFile(s3n://...) is taking at a few 
seconds to complete.   
           val saveFunc = (rdd: RDD[String], time: Time) => {   
  
             val count = rdd.count() 
             if (count > 0) { 
                 val s3BucketInterval = time.milliseconds.toString  
   
                rdd.saveAsTextFile(s3n://...)   
            }         } 
         dataStream.foreachRDD(saveFunc)  
  
  Should I expect the same behaviour in a deployed cluster? Or does the 
rdd.saveAsTextFile(s3n://...) distribute the push work to each worker node? 
 
  "Write the elements of the dataset as a text file (or set of text 
files) in a given directory in the local filesystem, HDFS or any other 
Hadoop-supported file system. Spark will call toString on each element to 
convert it to a line of text in the file."  
  Thanks, Mike.