I dont think any one has really run 500 text streams. And parSequences do nothing out there, you are only parallelizing the setup code which does not really compute anything. Also it setsup 500 foreachRDD operations that will get executed in each batch sequentially, so does not make sense. The write way to parallelize this is union all the streams.
val streams = streamPaths.map { path => ssc.textFileStream(path) } val unionedStream = streamingContext.union(streams) unionedStream.foreachRDD { rdd => // do something } Then there is only one foreachRDD executed in every batch that will process in parallel all the new files in each batch interval. TD On Tue, Jul 28, 2015 at 3:06 PM, Brandon White <bwwintheho...@gmail.com> wrote: > val ssc = new StreamingContext(sc, Minutes(10)) > > //500 textFile streams watching S3 directories > val streams = streamPaths.par.map { path => > ssc.textFileStream(path) > } > > streams.par.foreach { stream => > stream.foreachRDD { rdd => > //do something > } > } > > ssc.start() > > Would something like this scale? What would be the limiting factor to > performance? What is the best way to parallelize this? Any other ideas on > design? >