Take a look in here: http://stackoverflow.com/questions/33424445/is-there-a-way-to-checkpoint-apache-spark-dataframes So all you have to do create a checkpoint for a dataframe is as follow: df.rdd.checkpoint df.rdd.count // or any action
2016-05-25 8:43 GMT+07:00 naliazheli <754405...@qq.com>: > i am using spark1.6 and noticed time between jobs get longer,sometimes it > could be 20 mins. > i tried to search same questions ,and found a close one : > > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-app-gets-slower-as-it-gets-executed-more-times-td1089.html#a1146 > > and found something useful: > One thing to worry about is long-running jobs or shells. Currently, state > buildup of a single job in Spark is a problem, as certain state such as > shuffle files and RDD metadata is not cleaned up until the job (or shell) > exits. We have hacky ways to reduce this, and are working on a long term > solution. However, separate, consecutive jobs should be independent in > terms > of performance. > > > On Sat, Feb 1, 2014 at 8:27 PM, 尹绪森 <[hidden email]> wrote: > Is your spark app an iterative one ? If so, your app is creating a big DAG > in every iteration. You should use checkpoint it periodically, say, 10 > iterations one checkpoint. > > i also wrote a test program,there is the code: > > public static void newJob(int jobNum,SQLContext sqlContext){ > for(int i=0;i<jobNum;i++){ > testJob(i,sqlContext); > } > } > > > public static void testJob(int jobNum,SQLContext sqlContext){ > String test_sql =" SELECT a.* FROM income a"; > DataFrame test_df = sqlContext.sql(test_sql); > test_df.registerTempTable("income"); > test_df.cache(); > test_df.count(); > test_df.show(); > } > } > > function newJob(100,sqlContext) could reproduce my issue,job build cost > more and more time . > DataFrame without close api like checkpoint. > Is there anothor way to resolve it? > > > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/job-build-cost-more-and-more-time-tp27017.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >