Re: Generic Interface between RDD and DStream

Tathagata Das Fri, 11 Jul 2014 19:55:20 -0700

Hey Andy,

Thats pretty cool!! Is there a github repo where you can share this piece
of code for us to play around? If we can come up with a simple enough
general pattern, that can be very usefull!


TD


On Fri, Jul 11, 2014 at 4:12 PM, andy petrella <andy.petre...@gmail.com>
wrote:

> A while ago, I wrote this:
> ```
>
>
> package com.virdata.core.compute.common.api
>
>
>
> import org.apache.spark.rdd.RDD
>
>
> import org.apache.spark.SparkContext
> import org.apache.spark.streaming.dstream.DStream
>
>
> import org.apache.spark.streaming.StreamingContext
>
>
>
> sealed trait SparkEnvironment extends Serializable {
>
>
>   type Context
>
>
>   type Wagon[A]
>
>
> }
> object Batch extends SparkEnvironment {
>
>
>   type Context = SparkContext
>
>
>   type Wagon[A] = RDD[A]
>
>
> }
> object Streaming extends SparkEnvironment{
>
>
>   type Context = StreamingContext
>
>
>   type Wagon[A] = DStream[A]
>
>
> }
>
> ```
> Then I can produce code like this (just an example)
>
> ```
>
>
>
> package com.virdata.core.compute.common.api
>
>
> import org.apache.spark.Logging
>
>
> trait Process[M[_], In, N[_], Out, E <: SparkEnvironment] extends Logging { 
> self =>
>
>
>
>   def run(in:M[E#Wagon[In]])(implicit context:E#Context):N[E#Wagon[Out]]
>
>
>
>   def pipe[Q[_],U](follow:Process[N,Out,Q,U,E]):Process[M,In,Q,U,E] = new 
> Process[M,In,Q,U,E] {
>
>
>     override def run(in: M[E#Wagon[In]])(implicit context: E#Context): 
> Q[E#Wagon[U]] = {
>
>
>       val run1: N[E#Wagon[Out]] = self.run(in)
>
>
>       follow.run(run1)
>
>
>     }
>   }
>
> }
>
> ```
>
> It's not resolving the whole thing, because we'll still have to duplicate
> both code (for Batch and Streaming).
> However, when the common traits will be there I'll have to remove half of
> the implementations only -- without touching the calling side (using them),
> and thus keeping my plain old backward compat' ^^.
>
> I know it's just an intermediate hack, but still ;-)
>
> greetz,
>
>
>   aℕdy ℙetrella
> about.me/noootsab
> [image: aℕdy ℙetrella on about.me]
>
> <http://about.me/noootsab>
>
>
> On Sat, Jul 12, 2014 at 12:57 AM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> I totally agree that doing if we are able to do this it will be very
>> cool. However, this requires having a common trait / interface between RDD
>> and DStream, which we dont have as of now. It would be very cool though. On
>> my wish list for sure.
>>
>> TD
>>
>>
>> On Thu, Jul 10, 2014 at 11:53 AM, mshah <shahmaul...@gmail.com> wrote:
>>
>>> I wanted to get a perspective on how to share code between Spark batch
>>> processing and Spark Streaming.
>>>
>>> For example, I want to get unique tweets stored in a HDFS file then in
>>> both
>>> Spark Batch and Spark Streaming. Currently I will have to do following
>>> thing:
>>>
>>> Tweet {
>>> String tweetText;
>>> String userId;
>>> }
>>>
>>> Spark Batch:
>>> tweets = sparkContext.newHadoopApiAsFile("tweet");
>>>
>>> def getUniqueTweets(tweets: RDD[Tweet])= {
>>>      tweets.map(tweet=>(tweetText,
>>> tweet).groupByKey(tweetText).map((tweetText, _) =>tweetText)
>>> }
>>>
>>> Spark Streaming:
>>>
>>> tweets = streamingContext.fileStream("tweet");
>>>
>>> def getUniqueTweets(tweets: DStream[Tweet])= {
>>>      tweets.map(tweet=>(tweetText,
>>> tweet).groupByKey(tweetText).map((tweetText, _) =>tweetText)
>>> }
>>>
>>>
>>> Above example shows I am doing the same thing but I have to replicate the
>>> code as there is no common abstraction between DStream and RDD,
>>> SparkContext
>>> and Streaming Context.
>>>
>>> If there was a common abstraction it would have been much simlper:
>>>
>>> tweets = context.read("tweet", Stream or Batch)
>>>
>>> def getUniqueTweets(tweets: SparkObject[Tweet])= {
>>>      tweets.map(tweet=>(tweetText,
>>> tweet).groupByKey(tweetText).map((tweetText, _) =>tweetText)
>>> }
>>>
>>> I would appreciate thoughts on it. Is it already available? Is there any
>>> plan to add this support? Is it intentionally not supported because of
>>> design choice?
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Generic-Interface-between-RDD-and-DStream-tp9331.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>
>>
>

Re: Generic Interface between RDD and DStream

Reply via email to