Hi experts, I'm using Apache Spark Streaming 1.6.1 to write a Java application that joins two Key/Value data streams and writes the output to HDFS. The two data streams contain K/V strings and are periodically ingested in Spark from HDFS by using textFileStream(). The two data streams aren't synchronized, which means that some keys that are in stream1 at time t0 may appear in stream2 at time t1, or the vice versa. Hence, my goal is to join the two streams and compute "leftover" keys, which should be considered for the join operation in the next batch intervals. To better clarify this, look at the following algorithm:
variables: stream1 = <String, String> input stream at time t1 stream2 = <String, String> input stream at time t1 left_keys_s1 = <String, String> records of stream1 that didn't appear in the join at time t0 left_keys_s2 = <String, String> records of stream2 that didn't appear in the join at time t0 operations at time t1: out_stream = (stream1 + left_keys_s1) join (stream2 + left_keys_s2) write out_stream to HDFS left_keys_s1 = left_keys_s1 + records of stream1 not in out_stream (should be used at time t2) left_keys_s2 = left_keys_s2 + records of stream2 not in out_stream (should be used at time t2) I've tried to implement this algorithm with Spark Streaming unsuccessfully. Initially, I create two empty streams for leftover keys in this way (this is only one stream, but the code to generate the second stream is similar): JavaRDD<String> empty_rdd = sc.emptyRDD(); //sc = Java Spark Context Queue<JavaRDD<String>> q = new LinkedList<JavaRDD<String>>(); q.add(empty_rdd); JavaDStream<String> empty_dstream = jssc.queueStream(q); JavaPairDStream<String, String> k1 = empty_dstream.mapToPair(new PairFunction<String, String, String> () { @Override public scala.Tuple2<String, String> call(String s) { return new scala.Tuple2(s, s); } }); Later on, this empty stream is unified (i.e., union()) with stream1 and finally, after the join, I add the leftover keys from stream1 and call window(). The same happens with stream2. The problem is that the operations that generate left_keys_s1 and left_keys_s2 are transformations without actions, which means that Spark doesn't create any RDD flow graph and, hence, they are never executed. What I get right now is a join that outputs only the records whose keys are in stream1 and stream2 in the same time interval. Do you guys have any suggestion to implement this correctly with Spark? Thanks, Marco -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-carry-data-streams-over-multiple-batch-intervals-in-Spark-Streaming-tp26994.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org