Github user shengshuyang commented on the issue: https://github.com/apache/spark/pull/21929 @brucezhao11 Yup it was strange, after all your patch only touches the checkpoint recovery process, shouldn't cause any problem once it starts to process new batches. I did rebase your change onto 2.3.0 release though, wonder if that made a difference. Each of our consumers only had one receiver process, and if you meant "if receivers across all consumers die at the same time", it's not the case either. Receivers die one at a time randomly. We have monitoring set up so our scheduling system kills and restarts a consumer when it happens, but checking our restart history, we cant' find any obvious pattern, except that the issue disappeared immediately after rollbacking to unpatched version. Reading some documentation about spark internals, there should be a [receiver supervisor](https://jaceklaskowski.gitbooks.io/spark-streaming/content/spark-streaming-receiversupervisors.html) in charge of restarting receiver process, if you ran into the same issue in the future (hope not), it might worth digging into that.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org