Thanks everyone for the reply. Looks like foreachRDD + filtering is the way to go. I'll have 4 independent Spark streaming applications so the overhead seems acceptable.
Jianshi On Fri, Apr 17, 2015 at 5:17 PM, Evo Eftimov <evo.efti...@isecc.com> wrote: > Good use of analogies J > > > > Yep friction (or entropy in general) exists in everything – but hey by > adding and doing “more work” at the same time (aka more powerful rockets) > some people have overcome the friction of the air and even got as far as > the moon and beyond > > > > It is all about the bottom lime / the big picture – in some models, > friction can be a huge factor in the equations in some other it is just > part of the landscape > > > > *From:* Gerard Maas [mailto:gerard.m...@gmail.com] > *Sent:* Friday, April 17, 2015 10:12 AM > > *To:* Evo Eftimov > *Cc:* Tathagata Das; Jianshi Huang; user; Shao, Saisai; Huang Jie > *Subject:* Re: How to do dispatching in Streaming? > > > > Evo, > > > > In Spark there's a fixed scheduling cost for each task, so more tasks mean > an increased bottom line for the same amount of work being done. The number > of tasks per batch interval should relate to the CPU resources available > for the job following the same 'rule of thumbs' than for Spark, being 2-3 > times the #of cores. > > > > In that physical model presented before, I think we could consider this > scheduling cost as a form of friction. > > > > -kr, Gerard. > > > > On Thu, Apr 16, 2015 at 11:47 AM, Evo Eftimov <evo.efti...@isecc.com> > wrote: > > Ooops – what does “more work” mean in a Parallel Programming paradigm and > does it always translate in “inefficiency” > > > > Here are a few laws of physics in this space: > > > > 1. More Work if done AT THE SAME time AND fully utilizes the > cluster resources is a GOOD thing > > 2. More Work which can not be done at the same time and has to be > processed sequentially is a BAD thing > > > > So the key is whether it is about 1 or 2 and if it is about 1, whether it > leads to e.g. Higher Throughput and Lower Latency or not > > > > Regards, > > Evo Eftimov > > > > *From:* Gerard Maas [mailto:gerard.m...@gmail.com] > *Sent:* Thursday, April 16, 2015 10:41 AM > *To:* Evo Eftimov > *Cc:* Tathagata Das; Jianshi Huang; user; Shao, Saisai; Huang Jie > > > *Subject:* Re: How to do dispatching in Streaming? > > > > From experience, I'd recommend using the dstream.foreachRDD method and > doing the filtering within that context. Extending the example of TD, > something like this: > > > > dstream.foreachRDD { rdd => > > rdd.cache() > > messageType.foreach (msgTyp => > > val selection = rdd.filter(msgTyp.match(_)) > > selection.foreach { ... } > > } > > rdd.unpersist() > > } > > > > I would discourage the use of: > > MessageType1DStream = MainDStream.filter(message type1) > > MessageType2DStream = MainDStream.filter(message type2) > > MessageType3DStream = MainDStream.filter(message type3) > > > > Because it will be a lot more work to process on the spark side. > > Each DSteam will schedule tasks for each partition, resulting in #dstream > x #partitions x #stages tasks instead of the #partitions x #stages with the > approach presented above. > > > > > > -kr, Gerard. > > > > On Thu, Apr 16, 2015 at 10:57 AM, Evo Eftimov <evo.efti...@isecc.com> > wrote: > > And yet another way is to demultiplex at one point which will yield > separate DStreams for each message type which you can then process in > independent DAG pipelines in the following way: > > > > MessageType1DStream = MainDStream.filter(message type1) > > MessageType2DStream = MainDStream.filter(message type2) > > MessageType3DStream = MainDStream.filter(message type3) > > > > Then proceed your processing independently with MessageType1DStream, > MessageType2DStream and MessageType3DStream ie each of them is a starting > point of a new DAG pipeline running in parallel > > > > *From:* Tathagata Das [mailto:t...@databricks.com] > *Sent:* Thursday, April 16, 2015 12:52 AM > *To:* Jianshi Huang > *Cc:* user; Shao, Saisai; Huang Jie > *Subject:* Re: How to do dispatching in Streaming? > > > > It may be worthwhile to do architect the computation in a different way. > > > > dstream.foreachRDD { rdd => > > rdd.foreach { record => > > // do different things for each record based on filters > > } > > } > > > > TD > > > > On Sun, Apr 12, 2015 at 7:52 PM, Jianshi Huang <jianshi.hu...@gmail.com> > wrote: > > Hi, > > > > I have a Kafka topic that contains dozens of different types of messages. > And for each one I'll need to create a DStream for it. > > > > Currently I have to filter the Kafka stream over and over, which is very > inefficient. > > > > So what's the best way to do dispatching in Spark Streaming? (one DStream > -> multiple DStreams) > > > > > Thanks, > > -- > > Jianshi Huang > > LinkedIn: jianshi > Twitter: @jshuang > Github & Blog: http://huangjs.github.com/ > > > > > > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/