Hi experts,I'm using Apache Spark Streaming 1.6.1 to write a Java application
that joins two Key/Value data streams and writes the output to HDFS. The two
data streams contain K/V strings and are periodically ingested in Spark from
HDFS by using textFileStream().
The two data streams aren't synchronized, which means that some keys that are
in stream1 at time t0 may appear in stream2 at time t1, or the vice versa.
Hence, my goal is to join the two streams and compute "leftover" keys, which
should be considered for the join operation in the next batch intervals.To
better clarify this, look at the following algorithm:variables:
stream1 = input stream at time t1
stream2 = input stream at time t1
left_keys_s1 = records of stream1 that didn't appear in the
join at time t0
left_keys_s2 = records of stream2 that didn't appear in the
join at time t0
operations at time t1:
out_stream = (stream1 + left_keys_s1) join (stream2 + left_keys_s2)
write out_stream to HDFS
left_keys_s1 = left_keys_s1 + records of stream1 not in out_stream (should be
used at time t2)
left_keys_s2 = left_keys_s2 + records of stream2 not in out_stream (should be
used at time t2)
I've tried to implement this algorithm with Spark Streaming unsuccessfully.
Initially, I create two empty streams for leftover keys in this way (this is
only one stream, but the code to generate the second stream is
similar):JavaRDD empty_rdd = sc.emptyRDD(); //sc = Java Spark Context
Queue q = new LinkedList();
q.add(empty_rdd);
JavaDStream empty_dstream = jssc.queueStream(q);
JavaPairDStream k1 = empty_dstream.mapToPair(new
PairFunction () {
@Override
public scala.Tuple2
call(String s) {
return new scala.Tuple2(s, s);
}
});
Later on, this empty stream is unified (i.e., union()) with stream1 and
finally, after the join, I add the leftover keys from stream1 and call
window(). The same happens with stream2.The problem is that the operations that
generate left_keys_s1 and left_keys_s2 are transformations without actions,
which means that Spark doesn't create any RDD flow graph and, hence, they are
never executed. What I get right now is a join that outputs only the records
whose keys are in stream1 and stream2 in the same time interval.Do you guys
have any suggestion to implement this correctly with Spark?Thanks, Marco