Dear Spark developers, I am working with Spark streaming 1.6.1. The task is to get RDDs for some external analytics from each timewindow. This external function accepts RDD so I cannot use DStream. I learned that DStream.window.compute(time) returns Option[RDD]. I am trying to use it in the following code derived from the example in programming guide: val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999) val rdd = lines.window(Seconds(5), Seconds(3)).compute(Time(System.currentTimeMillis())) // this does not seem to be a proper way to set time ssc.start() ssc.awaitTermination() At the line with rdd I get the following exception: Exception in thread "main" org.apache.spark.SparkException: org.apache.spark.streaming.dstream.SocketInputDStream@2264e43c has not been initialized. The other option to get RDD from DStream is to use "slice" function. However, it is not clear how to use it and I get the same exception with the following use: val rdd = lines.slice(Time(System.currentTimeMillis() - 100), Time(System.currentTimeMillis())) // does not seem correct Could you suggest what is the proper use of "compute" or "slice" functions from DStream or another way to get RDD from DStream? Best regards, Alexander P.S. I have found the following example that does streaming within the loop, however it looks hacky: https://github.com/chlam4/spark-exercises/blob/master/using-dstream-slice.scala