Spark Streaming - long garbage collection time

2016-06-03 Thread Marco1982
Hi all, I'm running a Spark Streaming application with 1-hour batches to join two data feeds and write the output to disk. The total size of one data feed is about 40 GB per hour (split in multiple files), while the size of the second data feed is about 600-800 MB per hour (also split in multiple

Neither previous window has value for key, nor new values found.

2016-06-10 Thread Marco1982
Hi all, I'm running a Spark Streaming application that uses reduceByKeyAndWindow(). The window interval is 2 hours, while the slide interval is 1 hour. I have a JavaPairRDD in which both keys and values are strings. Each time the reduceByKeyAndWindow() function is called, it uses appendString()

Spark Streaming - Is window() caching DStreams?

2016-05-27 Thread Marco1982
Dear all, Can someone please explain me how Spark Streaming executes the window() operation? From the Spark 1.6.1 documentation, it seems that windowed batches are automatically cached in memory, but looking at the web UI it seems that operations already executed in previous batches are executed

Symbolic links in Spark

2016-06-01 Thread Marco1982
Hi all, It seems to me that Spark Streaming doesn't read symbolic links. Do you confirm that? Marco -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Symbolic-links-in-Spark-tp27062.html Sent from the Apache Spark User List mailing list archive at

How to carry data streams over multiple batch intervals in Spark Streaming

2016-05-21 Thread Marco1982
Hi experts, I'm using Apache Spark Streaming 1.6.1 to write a Java application that joins two Key/Value data streams and writes the output to HDFS. The two data streams contain K/V strings and are periodically ingested in Spark from HDFS by using textFileStream(). The two data streams aren't