I wanted to get a perspective on how to share code between Spark batch
processing and Spark Streaming. 

For example, I want to get unique tweets stored in a HDFS file then in both
Spark Batch and Spark Streaming. Currently I will have to do following
thing: 

Tweet { 
String tweetText; 
String userId; 
} 

Spark Batch: 
tweets = sparkContext.newHadoopApiAsFile("tweet"); 

def getUniqueTweets(tweets: RDD[Tweet])= { 
     tweets.map(tweet=>(tweetText,
tweet).groupByKey(tweetText).map((tweetText, _) =>tweetText) 
} 

Spark Streaming: 

tweets = streamingContext.fileStream("tweet"); 

def getUniqueTweets(tweets: DStream[Tweet])= { 
     tweets.map(tweet=>(tweetText,
tweet).groupByKey(tweetText).map((tweetText, _) =>tweetText) 
} 


Above example shows I am doing the same thing but I have to replicate the
code as there is no common abstraction between DStream and RDD, SparkContext
and Streaming Context. 

If there was a common abstraction it would have been much simlper: 

tweets = context.read("tweet", Stream or Batch) 

def getUniqueTweets(tweets: SparkObject[Tweet])= { 
     tweets.map(tweet=>(tweetText,
tweet).groupByKey(tweetText).map((tweetText, _) =>tweetText) 
} 

I would appreciate thoughts on it. Is it already available? Is there any
plan to add this support? Is it intentionally not supported because of
design choice? 





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Generic-Interface-between-RDD-and-DStream-tp9331.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to