How to carry data streams over multiple batch intervals in Spark Streaming

Marco Platania Sat, 21 May 2016 09:29:28 -0700

Hi experts,I'm using Apache Spark Streaming 1.6.1 to write a Java application 
that joins two Key/Value data streams and writes the output to HDFS. The two 
data streams contain K/V strings and are periodically ingested in Spark from 
HDFS by using textFileStream().
The two data streams aren't synchronized, which means that some keys that are 
in stream1 at time t0 may appear in stream2 at time t1, or the vice versa. 
Hence, my goal is to join the two streams and compute "leftover" keys, which 
should be considered for the join operation in the next batch intervals.To 
better clarify this, look at the following algorithm:variables:
stream1 = <String, String> input stream at time t1
stream2 = <String, String> input stream at time t1
left_keys_s1 = <String, String> records of stream1 that didn't appear in the 
join at time t0
left_keys_s2 = <String, String> records of stream2 that didn't appear in the 
join at time t0


operations at time t1:
out_stream = (stream1 + left_keys_s1) join (stream2 + left_keys_s2)
write out_stream to HDFS
left_keys_s1 = left_keys_s1 + records of stream1 not in out_stream (should be 
used at time t2)
left_keys_s2 = left_keys_s2 + records of stream2 not in out_stream (should be 
used at time t2)
I've tried to implement this algorithm with Spark Streaming unsuccessfully. 
Initially, I create two empty streams for leftover keys in this way (this is 
only one stream, but the code to generate the second stream is 
similar):JavaRDD<String> empty_rdd = sc.emptyRDD(); //sc = Java Spark Context
Queue<JavaRDD<String>> q = new LinkedList<JavaRDD<String>>();
q.add(empty_rdd);
JavaDStream<String> empty_dstream = jssc.queueStream(q);
JavaPairDStream<String, String> k1 = empty_dstream.mapToPair(new 
PairFunction<String, String, String> () {
                                 @Override
                                 public scala.Tuple2<String, String> 
call(String s) {
                                   return new scala.Tuple2(s, s);
                                 }
                               });
Later on, this empty stream is unified (i.e., union()) with stream1 and 
finally, after the join, I add the leftover keys from stream1 and call 
window(). The same happens with stream2.The problem is that the operations that 
generate left_keys_s1 and left_keys_s2 are transformations without actions, 
which means that Spark doesn't create any RDD flow graph and, hence, they are 
never executed. What I get right now is a join that outputs only the records 
whose keys are in stream1 and stream2 in the same time interval.Do you guys 
have any suggestion to implement this correctly with Spark?Thanks, Marco

How to carry data streams over multiple batch intervals in Spark Streaming

Reply via email to