@Das, Is there anyway to identify a kafka topic when we have unified
stream? As of now, for each topic I create dedicated DStream and use
foreachRDD on each of these Streams. If I have say 100 kafka topics, then
how can I use unified stream and still take topic specific actions inside
foreachRDD ?

On Tue, Jul 28, 2015 at 6:42 PM, Tathagata Das <t...@databricks.com> wrote:

> I dont think any one has really run 500 text streams.
> And parSequences do nothing out there, you are only parallelizing the
> setup code which does not really compute anything. Also it setsup 500
> foreachRDD operations that will get executed in each batch sequentially, so
> does not make sense. The write way to parallelize this is union all the
> streams.
>
> val streams = streamPaths.map { path =>
>   ssc.textFileStream(path)
> }
> val unionedStream = streamingContext.union(streams)
> unionedStream.foreachRDD { rdd =>
>   // do something
> }
>
> Then there is only one foreachRDD executed in every batch that will
> process in parallel all the new files in each batch interval.
> TD
>
>
> On Tue, Jul 28, 2015 at 3:06 PM, Brandon White <bwwintheho...@gmail.com>
> wrote:
>
>> val ssc = new StreamingContext(sc, Minutes(10))
>>
>> //500 textFile streams watching S3 directories
>> val streams = streamPaths.par.map { path =>
>>   ssc.textFileStream(path)
>> }
>>
>> streams.par.foreach { stream =>
>>   stream.foreachRDD { rdd =>
>>     //do something
>>   }
>> }
>>
>> ssc.start()
>>
>> Would something like this scale? What would be the limiting factor to
>> performance? What is the best way to parallelize this? Any other ideas on
>> design?
>>
>
>


-- 
Thanks & Regards,
Ashwin Giridharan

Reply via email to