It depends on how big the Batch RDD requiring reloading is
Reloading it for EVERY single DStream RDD would slow down the stream processing inline with the total time required to reload the Batch RDD ….. But if the Batch RDD is not that big then that might not be an issues especially in the context of the latency requirements for your streaming app Another more efficient and real-time approach may be to represent your Batch RDD as a Dstraeam RDDs and keep aggregating them over the lifetime of the spark streaming app instance and keep joining with the actual Dstream RDDs You can feed your HDFS file into a Message Broker topic and consume it from there in the form of DStream RDDs which you keep aggregating over the lifetime of the spark streaming app instance From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Wednesday, June 10, 2015 8:36 AM To: Ilove Data Cc: user@spark.apache.org Subject: Re: Join between DStream and Periodically-Changing-RDD RDD's are immutable, why not join two DStreams? Not sure, but you can try something like this also: kvDstream.foreachRDD(rdd => { val file = ssc.sparkContext.textFile("/sigmoid/") val kvFile = file.map(x => (x.split(",")(0), x)) rdd.join(kvFile) }) Thanks Best Regards On Tue, Jun 9, 2015 at 7:37 PM, Ilove Data <data4...@gmail.com> wrote: Hi, I'm trying to join DStream with interval let say 20s, join with RDD loaded from HDFS folder which is changing periodically, let say new file is coming to the folder for every 10 minutes. How should it be done, considering the HDFS files in the folder is periodically changing/adding new files? Do RDD automatically detect changes in HDFS folder as RDD source and automatically reload RDD? Thanks! Rendy