Hey Mike- Great to see you're using the AWS stack to its fullest!
I've already created the Kinesis-Spark Streaming connector with examples, documentation, test, and everything. You'll need to build Spark from source with the -Pkinesis-asl profile, otherwise they won't be included in the build. This is due to the Amazon Software License (ASL). Here's the link to the Kinesis-Spark Streaming integration guide: http://spark.apache.org/docs/1.2.0/streaming-kinesis-integration.html. Here's a link to the source: https://github.com/apache/spark/tree/master/extras/kinesis-asl/src/main/scala/org/apache/spark Here's the original jira as well as some follow-up jiras that I'm working on now: - https://issues.apache.org/jira/browse/SPARK-1981 - https://issues.apache.org/jira/browse/SPARK-5959 - https://issues.apache.org/jira/browse/SPARK-5960 I met a lot of folks at the Strata San Jose conference a couple weeks ago who are using this Kinesis connector with good results, so you should be OK there. Regarding Redshift, the current Redshift connector only pulls data from Redshift - it doesn't not write to Redshift. And actually, it only reads files that have been UNLOADed from Redshift into S3, so it's not pulling from Redshift directly. This is confusing, I know. Here's the jira for this work: https://issues.apache.org/jira/browse/SPARK-3205 For now, you'd have to use the AWS Java SDK to write to Redshift directly from within your Spark Streaming worker as batches of data come in from the Kinesis Stream. This is roughly how your Spark Streaming app would look: dstream.foreachRDD { rdd => // everything within here runs on the Driver rdd.foreachPartition { partitionOfRecords => // everything within here runs on the Worker and operates on a partition of records // RedshiftConnectionPool is a static, lazily initialized singleton pool of connections that runs within the Worker JVM // retrieve a connection from the pool val connection = RedshiftConnectionPool.getConnection() // perform the application logic here - parse and write to Redshift using the connection partitionOfRecords.foreach(record => connection.send(record)) // return to the pool for future reuse RedshiftConnectionPool.returnConnection(connection) } } Note that you would need to write the RedshiftConnectionPool single class yourself using the AWS Java SDK as I mentioned. There is a relatively new Spark SQL DataSources API that supports reading and writing to these data sources. An example Avro implementation is here: http://spark-packages.org/package/databricks/spark-avro, although I think that's a read-only impl, but you get the idea. I've created 2 jiras for creating libraries for DynamoDB and Redshift that implement this DataSources API. Here are the jiras: - https://issues.apache.org/jira/browse/SPARK-6101 (DynamoDB) - https://issues.apache.org/jira/browse/SPARK-6102 (Redshift) Perhaps you can take a stab at SPARK-6102? I should be done with the DynamoDB impl in the next few weeks - scheduled for Spark 1.4.0. Hope that helps! :) -Chris On Sun, Mar 1, 2015 at 11:06 AM, Mike Trienis <mike.trie...@orcsol.com> wrote: > Hi All, > > I am looking at integrating a data stream from AWS Kinesis to AWS Redshift > and since I am already ingesting the data through Spark Streaming, it seems > convenient to also push that data to AWS Redshift at the same time. > > I have taken a look at the AWS kinesis connector although I am not sure it > was designed to integrate with Apache Spark. It seems more like a > standalone approach: > > - https://github.com/awslabs/amazon-kinesis-connectors > > There is also a Spark redshift integration library, however, it looks like > it was intended for pulling data rather than pushing data to AWS Redshift: > > - https://github.com/databricks/spark-redshift > > I finally took a look at a general Scala library that integrates with AWS > Redshift: > > - http://scalikejdbc.org/ > > Does anyone have any experience pushing data from Spark Streaming to AWS > Redshift? Does it make sense conceptually, or does it make more sense to > push data from AWS Kinesis to AWS Redshift VIA another standalone approach > such as the AWS Kinesis connectors. > > Thanks, Mike. > >