bq. streamingContext.remember("duration") did not help Can you give a bit more detail on the above ? Did you mean the job encountered OOME later on ?
Which Spark release are you using ? Cheers On Wed, Feb 17, 2016 at 6:03 PM, ramach1776 <ram...@s1776.com> wrote: > We have a streaming application containing approximately 12 jobs every > batch, > running in streaming mode (4 sec batches). Each job has several > transformations and 1 action (output to cassandra) which causes the > execution of the job (DAG) > > For example the first job, > > /job 1 > ---> receive Stream A --> map --> filter -> (union with another stream B) > --> map -->/ groupbykey --> transform --> reducebykey --> map > > Likewise we go thro' few more transforms and save to database (job2, > job3...) > > Recently we added a new transformation further downstream wherein we union > the output of DStream from job 1 (in italics) with output from a new > transformation(job 5). It appears the whole execution thus far is repeated > which is redundant (I can see this in execution graph & also performance -> > processing time). > > That is, with this additional transformation (union with a stream processed > upstream) each batch runs as much as 2.5 times slower compared to runs > without the union. If I cache the DStream from job 1(italics), performance > improves substantially but hit out of memory errors within few hours. > > What is the recommended way to cache/unpersist in such a scenario? there is > no dstream level "unpersist" > setting "spark.streaming.unpersist" to true and > streamingContext.remember("duration") did not help. > > > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/adding-a-split-and-union-to-a-streaming-application-cause-big-performance-hit-tp26259.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >