Thank you for this suggestion! But may I ask what's the advantage to use
checkpoint instead of cache here? Cuz they both cut lineage. I only know
checkpoint saves RDD in disk, while cache in memory. So may be it's for
reliability?

Also on http://spark.apache.org/docs/latest/streaming-programming-guide.html,
I have not seen usage of "foreachRDD" like mine. Here I am not pushing data
to external system. I just use it to update an RDD in Spark. Is this right?



2015-05-08 14:03 GMT+08:00 Shao, Saisai <saisai.s...@intel.com>:

> I think you could use checkpoint to cut the lineage of `MyRDD`, I have a
> similar scenario and I use checkpoint to workaround this problem :)
>
> Thanks
> Jerry
>
> -----Original Message-----
> From: yaochunnan [mailto:yaochun...@gmail.com]
> Sent: Friday, May 8, 2015 1:57 PM
> To: user@spark.apache.org
> Subject: Possible long lineage issue when using DStream to update a normal
> RDD
>
> Hi all,
> Recently in our project, we need to update a RDD using data regularly
> received from DStream, I plan to use "foreachRDD" API to achieve this:
> var MyRDD = ...
> dstream.foreachRDD { rdd =>
>   MyRDD = MyRDD.join(rdd).......
>   ...
> }
>
> Is this usage correct? My concern is, as I am repeatedly and endlessly
> reassigning MyRDD in order to update it, will it create a too long RDD
> lineage to process when I want to query MyRDD later on (similar as
> https://issues.apache.org/jira/browse/SPARK-4672) ?
>
> Maybe I should:
> 1. cache or checkpoint latest MyRDD and unpersist old MyRDD every time a
> dstream comes in.
> 2. use the unpublished IndexedRDD
> (https://github.com/amplab/spark-indexedrdd) to conduct efficient RDD
> update.
>
> As I lack experience using Spark Streaming and indexedRDD, I am here to
> make sure my thoughts are on the right track. Your wise suggestions will be
> greatly appreciated.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Possible-long-lineage-issue-when-using-DStream-to-update-a-normal-RDD-tp22812.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
> commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to