Re: Join two Spark Streaming

2016-06-07 Thread vinay453
.nabble.com/Join-two-Spark-Streaming-tp9052p27108.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h

Re: Join two Spark Streaming

2014-07-11 Thread Bill Jay
Hi Tathagata, Thanks for the solution. Actually, I will use the number of unique integers in the batch instead of accumulative number of unique integers. I do have two questions about your code: 1. Why do we need uniqueValuesRDD? Why do we need to call uniqueValuesRDD.checkpoint()? 2. Where

Re: Join two Spark Streaming

2014-07-11 Thread Tathagata Das
1. Since the RDD of the previous batch is used to create the RDD of the next batch, the lineage of dependencies in the RDDs continues to grow infinitely. Thats not good because of it increases fault-recover times, task sizes, etc. Checkpointing saves the data of an RDD to HDFS and truncates the

Re: Join two Spark Streaming

2014-07-10 Thread Tathagata Das
Do you want to continuously maintain the set of unique integers seen since the beginning of stream? var uniqueValuesRDD: RDD[Int] = ... dstreamOfIntegers.transform(newDataRDD = { val newUniqueValuesRDD = newDataRDD.union(distinctValues).distinct uniqueValuesRDD = newUniqueValuesRDD //

Join two Spark Streaming

2014-07-08 Thread Bill Jay
Hi all, I am working on a pipeline that needs to join two Spark streams. The input is a stream of integers. And the output is the number of integer's appearance divided by the total number of unique integers. Suppose the input is: 1 2 3 1 2 2 There are 3 unique integers and 1 appears twice.