Hello. I have a long running streaming application. It is consuming a large amount of data from Kafka (on the order of 25K messages / second in two minute batches. The job reads the data, makes some decisions on what to save, and writes the selected data into Cassandra. The job is very stable - each batch takes around 25 seconds to process, and every 30 minutes new training data is read from Cassandra which increases batch time to around 1 minute.
The issue I'm seeing is that the job is stable for around 5-7 hours, after which it takes an increasingly long time to compute each batch. The executor memory used (cached RDDs) remains around the same level, no operation takes more than a single batch into account, the write time to Cassandra does not vary significantly - everything just suddenly seems to take a longer period of time to compute. Initially I was seeing issues with instability in a shorter time horizon. To address these issues I took the following steps: 1. Explicitly expired RDDs via 'unpersist' once they were no longer required. 2. Turned on gen-1 GC via -XX:+UseG1GC 3. Enabled Kryo serialization 4. Removed several long-running aggregate operations (reduceByKeyAndWindow, updateStateByKey) from this job. The result is a job that appears completely stable for hours at a time. The OS does not appear to have any odd tasks run, Cassandra compaction/cleanup/repair is not responsible for the delay. Has anyone seen similar behavior? Any thoughts? Regards, Bryan Jeffrey