Thanks everyone for the reply.

Looks like foreachRDD + filtering is the way to go. I'll have 4 independent
Spark streaming applications so the overhead seems acceptable.

Jianshi


On Fri, Apr 17, 2015 at 5:17 PM, Evo Eftimov <evo.efti...@isecc.com> wrote:

> Good use of analogies J
>
>
>
> Yep friction (or entropy in general) exists in everything – but hey by
> adding and doing “more work” at the same time (aka more powerful rockets)
> some people have overcome the friction of the air and even got as far as
> the moon and beyond
>
>
>
> It is all about the bottom lime / the big picture – in some models,
> friction can be a huge factor in the equations in some other it is just
> part of the landscape
>
>
>
> *From:* Gerard Maas [mailto:gerard.m...@gmail.com]
> *Sent:* Friday, April 17, 2015 10:12 AM
>
> *To:* Evo Eftimov
> *Cc:* Tathagata Das; Jianshi Huang; user; Shao, Saisai; Huang Jie
> *Subject:* Re: How to do dispatching in Streaming?
>
>
>
> Evo,
>
>
>
> In Spark there's a fixed scheduling cost for each task, so more tasks mean
> an increased bottom line for the same amount of work being done. The number
> of tasks per batch interval should relate to the CPU resources available
> for the job following the same 'rule of thumbs' than for Spark, being 2-3
> times the #of cores.
>
>
>
> In that physical model presented before, I think we could consider this
> scheduling cost as a form of friction.
>
>
>
> -kr, Gerard.
>
>
>
> On Thu, Apr 16, 2015 at 11:47 AM, Evo Eftimov <evo.efti...@isecc.com>
> wrote:
>
> Ooops – what does “more work” mean in a Parallel Programming paradigm and
> does it always translate in “inefficiency”
>
>
>
> Here are a few laws of physics in this space:
>
>
>
> 1.       More Work if done AT THE SAME time AND fully utilizes the
> cluster resources is a GOOD thing
>
> 2.       More Work which can not be done at the same time and has to be
> processed sequentially is a BAD thing
>
>
>
> So the key is whether it is about 1 or 2 and if it is about 1, whether it
> leads to e.g. Higher Throughput and Lower Latency or not
>
>
>
> Regards,
>
> Evo Eftimov
>
>
>
> *From:* Gerard Maas [mailto:gerard.m...@gmail.com]
> *Sent:* Thursday, April 16, 2015 10:41 AM
> *To:* Evo Eftimov
> *Cc:* Tathagata Das; Jianshi Huang; user; Shao, Saisai; Huang Jie
>
>
> *Subject:* Re: How to do dispatching in Streaming?
>
>
>
> From experience, I'd recommend using the  dstream.foreachRDD method and
> doing the filtering within that context. Extending the example of TD,
> something like this:
>
>
>
> dstream.foreachRDD { rdd =>
>
>    rdd.cache()
>
>    messageType.foreach (msgTyp =>
>
>        val selection = rdd.filter(msgTyp.match(_))
>
>         selection.foreach { ... }
>
>     }
>
>    rdd.unpersist()
>
> }
>
>
>
> I would discourage the use of:
>
> MessageType1DStream = MainDStream.filter(message type1)
>
> MessageType2DStream = MainDStream.filter(message type2)
>
> MessageType3DStream = MainDStream.filter(message type3)
>
>
>
> Because it will be a lot more work to process on the spark side.
>
> Each DSteam will schedule tasks for each partition, resulting in #dstream
> x #partitions x #stages tasks instead of the #partitions x #stages with the
> approach presented above.
>
>
>
>
>
> -kr, Gerard.
>
>
>
> On Thu, Apr 16, 2015 at 10:57 AM, Evo Eftimov <evo.efti...@isecc.com>
> wrote:
>
> And yet another way is to demultiplex at one point which will yield
> separate DStreams for each message type which you can then process in
> independent DAG pipelines in the following way:
>
>
>
> MessageType1DStream = MainDStream.filter(message type1)
>
> MessageType2DStream = MainDStream.filter(message type2)
>
> MessageType3DStream = MainDStream.filter(message type3)
>
>
>
> Then proceed your processing independently with MessageType1DStream,
> MessageType2DStream and MessageType3DStream ie each of them is a starting
> point of a new DAG pipeline running in parallel
>
>
>
> *From:* Tathagata Das [mailto:t...@databricks.com]
> *Sent:* Thursday, April 16, 2015 12:52 AM
> *To:* Jianshi Huang
> *Cc:* user; Shao, Saisai; Huang Jie
> *Subject:* Re: How to do dispatching in Streaming?
>
>
>
> It may be worthwhile to do architect the computation in a different way.
>
>
>
> dstream.foreachRDD { rdd =>
>
>    rdd.foreach { record =>
>
>       // do different things for each record based on filters
>
>    }
>
> }
>
>
>
> TD
>
>
>
> On Sun, Apr 12, 2015 at 7:52 PM, Jianshi Huang <jianshi.hu...@gmail.com>
> wrote:
>
> Hi,
>
>
>
> I have a Kafka topic that contains dozens of different types of messages.
> And for each one I'll need to create a DStream for it.
>
>
>
> Currently I have to filter the Kafka stream over and over, which is very
> inefficient.
>
>
>
> So what's the best way to do dispatching in Spark Streaming? (one DStream
> -> multiple DStreams)
>
>
>
>
> Thanks,
>
> --
>
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>
>
>
>
>
>
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Reply via email to