How to carry data streams over multiple batch intervals in Spark Streaming

Marco1982 Sat, 21 May 2016 11:09:06 -0700

Hi experts,
I'm using Apache Spark Streaming 1.6.1 to write a Java application that
joins two Key/Value data streams and writes the output to HDFS. The two data
streams contain K/V strings and are periodically ingested in Spark from HDFS
by using textFileStream().
The two data streams aren't synchronized, which means that some keys that
are in stream1 at time t0 may appear in stream2 at time t1, or the vice
versa. Hence, my goal is to join the two streams and compute "leftover"
keys, which should be considered for the join operation in the next batch
intervals.
To better clarify this, look at the following algorithm:


variables:
stream1 = <String, String> input stream at time t1
stream2 = <String, String> input stream at time t1
left_keys_s1 = <String, String> records of stream1 that didn't appear in the
join at time t0
left_keys_s2 = <String, String> records of stream2 that didn't appear in the
join at time t0

operations at time t1:
out_stream = (stream1 + left_keys_s1) join (stream2 + left_keys_s2)
write out_stream to HDFS
left_keys_s1 = left_keys_s1 + records of stream1 not in out_stream (should
be used at time t2)
left_keys_s2 = left_keys_s2 + records of stream2 not in out_stream (should
be used at time t2)

I've tried to implement this algorithm with Spark Streaming unsuccessfully.
Initially, I create two empty streams for leftover keys in this way (this is
only one stream, but the code to generate the second stream is similar):

JavaRDD<String> empty_rdd = sc.emptyRDD(); //sc = Java Spark Context
Queue<JavaRDD&lt;String>> q = new LinkedList<JavaRDD&lt;String>>();
q.add(empty_rdd);
JavaDStream<String> empty_dstream = jssc.queueStream(q);
JavaPairDStream<String, String> k1 = empty_dstream.mapToPair(new
PairFunction<String, String, String> () {
                                 @Override
                                 public scala.Tuple2<String, String>
call(String s) {
                                   return new scala.Tuple2(s, s);
                                 }
                               });

Later on, this empty stream is unified (i.e., union()) with stream1 and
finally, after the join, I add the leftover keys from stream1 and call
window(). The same happens with stream2.
The problem is that the operations that generate left_keys_s1 and
left_keys_s2 are transformations without actions, which means that Spark
doesn't create any RDD flow graph and, hence, they are never executed. What
I get right now is a join that outputs only the records whose keys are in
stream1 and stream2 in the same time interval.
Do you guys have any suggestion to implement this correctly with Spark?

Thanks, 
Marco



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-carry-data-streams-over-multiple-batch-intervals-in-Spark-Streaming-tp26994.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

How to carry data streams over multiple batch intervals in Spark Streaming

Reply via email to