@Das, Is there anyway to identify a kafka topic when we have unified stream? As of now, for each topic I create dedicated DStream and use foreachRDD on each of these Streams. If I have say 100 kafka topics, then how can I use unified stream and still take topic specific actions inside foreachRDD ?
On Tue, Jul 28, 2015 at 6:42 PM, Tathagata Das <t...@databricks.com> wrote: > I dont think any one has really run 500 text streams. > And parSequences do nothing out there, you are only parallelizing the > setup code which does not really compute anything. Also it setsup 500 > foreachRDD operations that will get executed in each batch sequentially, so > does not make sense. The write way to parallelize this is union all the > streams. > > val streams = streamPaths.map { path => > ssc.textFileStream(path) > } > val unionedStream = streamingContext.union(streams) > unionedStream.foreachRDD { rdd => > // do something > } > > Then there is only one foreachRDD executed in every batch that will > process in parallel all the new files in each batch interval. > TD > > > On Tue, Jul 28, 2015 at 3:06 PM, Brandon White <bwwintheho...@gmail.com> > wrote: > >> val ssc = new StreamingContext(sc, Minutes(10)) >> >> //500 textFile streams watching S3 directories >> val streams = streamPaths.par.map { path => >> ssc.textFileStream(path) >> } >> >> streams.par.foreach { stream => >> stream.foreachRDD { rdd => >> //do something >> } >> } >> >> ssc.start() >> >> Would something like this scale? What would be the limiting factor to >> performance? What is the best way to parallelize this? Any other ideas on >> design? >> > > -- Thanks & Regards, Ashwin Giridharan