Hello.

I have a long running streaming application. It is consuming a large amount
of data from Kafka (on the order of 25K messages / second in two minute
batches.  The job reads the data, makes some decisions on what to save, and
writes the selected data into Cassandra.  The job is very stable - each
batch takes around 25 seconds to process, and every 30 minutes new training
data is read from Cassandra which increases batch time to around 1 minute.

The issue I'm seeing is that the job is stable for around 5-7 hours, after
which it takes an increasingly long time to compute each batch.  The
executor memory used (cached RDDs) remains around the same level, no
operation takes more than a single batch into account, the write time to
Cassandra does not vary significantly - everything just suddenly seems to
take a longer period of time to compute.

Initially I was seeing issues with instability in a shorter time horizon.
To address these issues I took the following steps:

1. Explicitly expired RDDs via 'unpersist' once they were no longer
required.
2. Turned on gen-1 GC via -XX:+UseG1GC
3. Enabled Kryo serialization
4. Removed several long-running aggregate operations (reduceByKeyAndWindow,
updateStateByKey) from this job.

The result is a job that appears completely stable for hours at a time.
The OS does not appear to have any odd tasks run, Cassandra
compaction/cleanup/repair is not responsible for the delay.   Has anyone
seen similar behavior?  Any thoughts?

Regards,

Bryan Jeffrey

Reply via email to