Thank you for this suggestion! But may I ask what's the advantage to use checkpoint instead of cache here? Cuz they both cut lineage. I only know checkpoint saves RDD in disk, while cache in memory. So may be it's for reliability?
Also on http://spark.apache.org/docs/latest/streaming-programming-guide.html, I have not seen usage of "foreachRDD" like mine. Here I am not pushing data to external system. I just use it to update an RDD in Spark. Is this right? 2015-05-08 14:03 GMT+08:00 Shao, Saisai <saisai.s...@intel.com>: > I think you could use checkpoint to cut the lineage of `MyRDD`, I have a > similar scenario and I use checkpoint to workaround this problem :) > > Thanks > Jerry > > -----Original Message----- > From: yaochunnan [mailto:yaochun...@gmail.com] > Sent: Friday, May 8, 2015 1:57 PM > To: user@spark.apache.org > Subject: Possible long lineage issue when using DStream to update a normal > RDD > > Hi all, > Recently in our project, we need to update a RDD using data regularly > received from DStream, I plan to use "foreachRDD" API to achieve this: > var MyRDD = ... > dstream.foreachRDD { rdd => > MyRDD = MyRDD.join(rdd)....... > ... > } > > Is this usage correct? My concern is, as I am repeatedly and endlessly > reassigning MyRDD in order to update it, will it create a too long RDD > lineage to process when I want to query MyRDD later on (similar as > https://issues.apache.org/jira/browse/SPARK-4672) ? > > Maybe I should: > 1. cache or checkpoint latest MyRDD and unpersist old MyRDD every time a > dstream comes in. > 2. use the unpublished IndexedRDD > (https://github.com/amplab/spark-indexedrdd) to conduct efficient RDD > update. > > As I lack experience using Spark Streaming and indexedRDD, I am here to > make sure my thoughts are on the right track. Your wise suggestions will be > greatly appreciated. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Possible-long-lineage-issue-when-using-DStream-to-update-a-normal-RDD-tp22812.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional > commands, e-mail: user-h...@spark.apache.org > >