Re: Spark Streaming S3 Performance Implications

2015-04-01 Thread Mike Trienis
Hey Chris,

Apologies for the delayed reply. Your responses are always insightful and
appreciated :-)

However, I have a few more questions.

also, it looks like you're writing to S3 per RDD.  you'll want to broaden
that out to write DStream batches

I assume you mean dstream.saveAsTextFiles() vs
rdd.saveAsTextFile(). Although looking at the source code
DStream.scala, the saveAsTextFiles is simply wrapping a rdd.saveAsTextFile

  def saveAsTextFiles(prefix: String, suffix: String = ) {
val saveFunc = (rdd: RDD[T], time: Time) = {
  val file = rddToFileName(prefix, suffix, time)
  rdd.saveAsTextFile(file)
}
this.foreachRDD(saveFunc)
  }

So it's not clear to me how it would improve the throughput.

Also, for your comment expand even further and write window batches (where
the window interval is a multiple of the batch interval). I still don't
quite understand mechanics underneath, for example, what would be the
difference between extending the batch interval and adding windowed
batches? I presume it has something to do with the processor thread(s)
within a receiver?

By the way, in the near future, I'll be putting together some performance
numbers on a proper deployment, and will be sure to share my findings.

Thanks, Mike!



On Sat, Mar 21, 2015 at 8:09 AM, Chris Fregly ch...@fregly.com wrote:

 hey mike!

 you'll definitely want to increase your parallelism by adding more shards
 to the stream - as well as spinning up 1 receiver per shard and unioning
 all the shards per the KinesisWordCount example that is included with the
 kinesis streaming package.

 you'll need more cores (cluster) or threads (local) to support this -
 equalling at least the number of shards/receivers + 1.

 also, it looks like you're writing to S3 per RDD.  you'll want to broaden
 that out to write DStream batches - or expand  even further and write
 window batches (where the window interval is a multiple of the batch
 interval).

 this goes for any spark streaming implementation - not just Kinesis.

 lemme know if that works for you.

 thanks!

 -Chris
 _
 From: Mike Trienis mike.trie...@orcsol.com
 Sent: Wednesday, March 18, 2015 2:45 PM
 Subject: Spark Streaming S3 Performance Implications
 To: user@spark.apache.org



  Hi All,

  I am pushing data from Kinesis stream to S3 using Spark Streaming and
 noticed that during testing (i.e. master=local[2]) the batches (1 second
 intervals) were falling behind the incoming data stream at about 5-10
 events / second. It seems that the rdd.saveAsTextFile(s3n://...) is taking
 at a few seconds to complete.

   val saveFunc = (rdd: RDD[String], time: Time) = {

  val count = rdd.count()

  if (count  0) {

  val s3BucketInterval = time.milliseconds.toString

 rdd.saveAsTextFile(s3n://...)

  }
  }

  dataStream.foreachRDD(saveFunc)


  Should I expect the same behaviour in a deployed cluster? Or does the
 rdd.saveAsTextFile(s3n://...) distribute the push work to each worker node?

  Write the elements of the dataset as a text file (or set of text files)
 in a given directory in the local filesystem, HDFS or any other
 Hadoop-supported file system. Spark will call toString on each element to
 convert it to a line of text in the file.

  Thanks, Mike.





Re: Spark Streaming S3 Performance Implications

2015-03-21 Thread Ted Yu
Mike:
Once hadoop 2.7.0 is released, you should be able to enjoy the enhanced
performance of s3a.
See HADOOP-11571

Cheers

On Sat, Mar 21, 2015 at 8:09 AM, Chris Fregly ch...@fregly.com wrote:

 hey mike!

 you'll definitely want to increase your parallelism by adding more shards
 to the stream - as well as spinning up 1 receiver per shard and unioning
 all the shards per the KinesisWordCount example that is included with the
 kinesis streaming package.

 you'll need more cores (cluster) or threads (local) to support this -
 equalling at least the number of shards/receivers + 1.

 also, it looks like you're writing to S3 per RDD.  you'll want to broaden
 that out to write DStream batches - or expand  even further and write
 window batches (where the window interval is a multiple of the batch
 interval).

 this goes for any spark streaming implementation - not just Kinesis.

 lemme know if that works for you.

 thanks!

 -Chris
 _
 From: Mike Trienis mike.trie...@orcsol.com
 Sent: Wednesday, March 18, 2015 2:45 PM
 Subject: Spark Streaming S3 Performance Implications
 To: user@spark.apache.org



  Hi All,

  I am pushing data from Kinesis stream to S3 using Spark Streaming and
 noticed that during testing (i.e. master=local[2]) the batches (1 second
 intervals) were falling behind the incoming data stream at about 5-10
 events / second. It seems that the rdd.saveAsTextFile(s3n://...) is taking
 at a few seconds to complete.

   val saveFunc = (rdd: RDD[String], time: Time) = {

  val count = rdd.count()

  if (count  0) {

  val s3BucketInterval = time.milliseconds.toString

 rdd.saveAsTextFile(s3n://...)

  }
  }

  dataStream.foreachRDD(saveFunc)


  Should I expect the same behaviour in a deployed cluster? Or does the
 rdd.saveAsTextFile(s3n://...) distribute the push work to each worker node?

  Write the elements of the dataset as a text file (or set of text files)
 in a given directory in the local filesystem, HDFS or any other
 Hadoop-supported file system. Spark will call toString on each element to
 convert it to a line of text in the file.

  Thanks, Mike.





Re: Spark Streaming S3 Performance Implications

2015-03-21 Thread Chris Fregly
hey mike!
you'll definitely want to increase your parallelism by adding more shards to 
the stream - as well as spinning up 1 receiver per shard and unioning all the 
shards per the KinesisWordCount example that is included with the kinesis 
streaming package. 
you'll need more cores (cluster) or threads (local) to support this - equalling 
at least the number of shards/receivers + 1.
also, it looks like you're writing to S3 per RDD.  you'll want to broaden that 
out to write DStream batches - or expand  even further and write window batches 
(where the window interval is a multiple of the batch interval).
this goes for any spark streaming implementation - not just Kinesis.
lemme know if that works for you.
thanks!
-Chris 
_
From: Mike Trienis mike.trie...@orcsol.com
Sent: Wednesday, March 18, 2015 2:45 PM
Subject: Spark Streaming S3 Performance Implications
To:  user@spark.apache.org


   Hi All,   
  I am pushing data from Kinesis stream to S3 using Spark Streaming and 
noticed that during testing (i.e. master=local[2]) the batches (1 second 
intervals) were falling behind the incoming data stream at about 5-10 events / 
second. It seems that the rdd.saveAsTextFile(s3n://...) is taking at a few 
seconds to complete.   
           val saveFunc = (rdd: RDD[String], time: Time) = {   
  
             val count = rdd.count() 
             if (count  0) { 
                 val s3BucketInterval = time.milliseconds.toString  
   
                rdd.saveAsTextFile(s3n://...)   
            }         } 
         dataStream.foreachRDD(saveFunc)  
  
  Should I expect the same behaviour in a deployed cluster? Or does the 
rdd.saveAsTextFile(s3n://...) distribute the push work to each worker node? 
 
  Write the elements of the dataset as a text file (or set of text 
files) in a given directory in the local filesystem, HDFS or any other 
Hadoop-supported file system. Spark will call toString on each element to 
convert it to a line of text in the file.  
  Thanks, Mike. 

Spark Streaming S3 Performance Implications

2015-03-18 Thread Mike Trienis
Hi All,

I am pushing data from Kinesis stream to S3 using Spark Streaming and
noticed that during testing (i.e. master=local[2]) the batches (1 second
intervals) were falling behind the incoming data stream at about 5-10
events / second. It seems that the rdd.saveAsTextFile(s3n://...) is taking
at a few seconds to complete.

val saveFunc = (rdd: RDD[String], time: Time) = {

val count = rdd.count()

if (count  0) {

val s3BucketInterval = time.milliseconds.toString

   rdd.saveAsTextFile(s3n://...)

}
}

dataStream.foreachRDD(saveFunc)


Should I expect the same behaviour in a deployed cluster? Or does the
rdd.saveAsTextFile(s3n://...) distribute the push work to each worker node?

Write the elements of the dataset as a text file (or set of text files) in
a given directory in the local filesystem, HDFS or any other
Hadoop-supported file system. Spark will call toString on each element to
convert it to a line of text in the file.

Thanks, Mike.