Hi Haopu, please check these threads: http://stackoverflow.com/questions/24331815/spark-streaming-historical-state
https://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter1/total.html Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links> 2016-06-13 3:11 GMT+02:00 Haopu Wang <hw...@qilinsoft.com>: > Can someone look at my questions? Thanks again! > > > ------------------------------ > > *From:* Haopu Wang > *Sent:* 2016年6月12日 16:40 > *To:* user@spark.apache.org > *Subject:* Should I avoid "state" in an Spark application? > > > > I have a Spark application whose structure is below: > > > > var ts: Long = 0L > > dstream1.foreachRDD{ > > (x, time) => { > > ts = time > > x.do_something()... > > } > > } > > ...... > > process_data(dstream2, ts, ......) > > > > I assume foreachRDD function call can update "ts" variable which is then > used in the Spark tasks of "process_data" function. > > > > From my test result of a standalone Spark cluster, it is working. But > should I concern if switch to YARN? > > > > And I saw some articles are recommending to avoid state in Scala > programming. Without the state variable, how could that be done? > > > > Any comments or suggestions are appreciated. > > > > Thanks, > > Haopu >