I wanted to get a perspective on how to share code between Spark batch processing and Spark Streaming.
For example, I want to get unique tweets stored in a HDFS file then in both Spark Batch and Spark Streaming. Currently I will have to do following thing: Tweet { String tweetText; String userId; } Spark Batch: tweets = sparkContext.newHadoopApiAsFile("tweet"); def getUniqueTweets(tweets: RDD[Tweet])= { tweets.map(tweet=>(tweetText, tweet).groupByKey(tweetText).map((tweetText, _) =>tweetText) } Spark Streaming: tweets = streamingContext.fileStream("tweet"); def getUniqueTweets(tweets: DStream[Tweet])= { tweets.map(tweet=>(tweetText, tweet).groupByKey(tweetText).map((tweetText, _) =>tweetText) } Above example shows I am doing the same thing but I have to replicate the code as there is no common abstraction between DStream and RDD, SparkContext and Streaming Context. If there was a common abstraction it would have been much simlper: tweets = context.read("tweet", Stream or Batch) def getUniqueTweets(tweets: SparkObject[Tweet])= { tweets.map(tweet=>(tweetText, tweet).groupByKey(tweetText).map((tweetText, _) =>tweetText) } I would appreciate thoughts on it. Is it already available? Is there any plan to add this support? Is it intentionally not supported because of design choice? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Generic-Interface-between-RDD-and-DStream-tp9331.html Sent from the Apache Spark User List mailing list archive at Nabble.com.