We have a Spark Streaming application that has basically zero scheduling
delay for hours, but then suddenly it jumps up to multiple minutes and
spirals out of control (see screenshot of job manager here:
http://i.stack.imgur.com/kSftN.png)

This is happens after a while even if we double the batch interval.

We are not sure what causes the delay to happen (theories include garbage
collection). The cluster has generally low CPU utilization regardless of
whether we use 3, 5 or 10 slaves.

We are really reluctant to further increase the batch interval, since the
delay is zero for such long periods. Are there any techniques to improve
recovery time from a sudden spike in scheduling delay? We've tried seeing
if it will recover on its own, but it takes hours if it even recovers at all

Thanks,
-cjoseph

Reply via email to