bs"d I am new to the Spark Streaming and have some issues which i can't find any documentation "stuff" to answer them
I believe a lot of Spark users in general and Spark Streaming in particular use it for analysis of events by calculation of distributed large aggregations. In case i have to "digest" a lot of events very fast and i perform some high resolution (e.g. every 30 seconds) and also hourly aggregates. 1. What happens when i take the DStream of RDDs generated by the 30 seconds aggregates and call the countByValueAndWindow method with windowDuration = 60 minutes and slide Duration = 60 minutes. Will each added RDD to the DStream be added as soon as it is generated or all the aggregation will be preformed after 60 minutes? If it is performed after 1 hour i guess it would be better to do the periodic aggregates myself using foreachRDD ? 2. I believe i understood from the documentation that DStreams are by default persisted and we should use checkpoint to "free" some memory In case i have the hourly aggregates and would only like to store them is it possible to free the DStreams without calling checkpoint which store the data to disk and may be a bottle neck 3. Is it possible to make "checkpoint" to be written not as HDFS file but in other format e.g. into Cassandra DB I find almost no documentation referring to the Spark Streaming project and hope someone which understands well the material will be able to shed some light into the subject Best DD -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-API-and-Performance-Clarifications-tp12717.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org