RE: Possible long lineage issue when using DStream to update a normal RDD

Shao, Saisai Fri, 08 May 2015 00:08:33 -0700

IIUC only checkpoint will clean the lineage information, cache will not cut the 
lineage. Also checkpoint will put the data in HDFS, not local disk :)

I think you can use foreachRDD to do such RDD update work, it’s OK as I know 
from your code snippet.

From: Chunnan Yao [mailto:yaochun...@gmail.com]
Sent: Friday, May 8, 2015 2:51 PM
To: Shao, Saisai
Cc: user@spark.apache.org
Subject: Re: Possible long lineage issue when using DStream to update a normal 
RDD

Thank you for this suggestion! But may I ask what's the advantage to use 
checkpoint instead of cache here? Cuz they both cut lineage. I only know 
checkpoint saves RDD in disk, while cache in memory. So may be it's for 
reliability?

Also on http://spark.apache.org/docs/latest/streaming-programming-guide.html, I 
have not seen usage of "foreachRDD" like mine. Here I am not pushing data to 
external system. I just use it to update an RDD in Spark. Is this right?

2015-05-08 14:03 GMT+08:00 Shao, Saisai 
<saisai.s...@intel.com<mailto:saisai.s...@intel.com>>:
I think you could use checkpoint to cut the lineage of `MyRDD`, I have a 
similar scenario and I use checkpoint to workaround this problem :)

Thanks
Jerry

-----Original Message-----
From: yaochunnan [mailto:yaochun...@gmail.com<mailto:yaochun...@gmail.com>]
Sent: Friday, May 8, 2015 1:57 PM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Possible long lineage issue when using DStream to update a normal RDD

Hi all,
Recently in our project, we need to update a RDD using data regularly received 
from DStream, I plan to use "foreachRDD" API to achieve this:
var MyRDD = ...
dstream.foreachRDD { rdd =>
  MyRDD = MyRDD.join(rdd).......
  ...
}

Is this usage correct? My concern is, as I am repeatedly and endlessly 
reassigning MyRDD in order to update it, will it create a too long RDD lineage 
to process when I want to query MyRDD later on (similar as
https://issues.apache.org/jira/browse/SPARK-4672) ?

Maybe I should:
1. cache or checkpoint latest MyRDD and unpersist old MyRDD every time a 
dstream comes in.
2. use the unpublished IndexedRDD
(https://github.com/amplab/spark-indexedrdd) to conduct efficient RDD update.

As I lack experience using Spark Streaming and indexedRDD, I am here to make 
sure my thoughts are on the right track. Your wise suggestions will be greatly 
appreciated.

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Possible-long-lineage-issue-when-using-DStream-to-update-a-normal-RDD-tp22812.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org> For 
additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>

RE: Possible long lineage issue when using DStream to update a normal RDD

Reply via email to