Github user shengshuyang commented on the issue:

    https://github.com/apache/spark/pull/21929
  
    @brucezhao11 Yup it was strange, after all your patch only touches the 
checkpoint recovery process, shouldn't cause any problem once it starts to 
process new batches. I did rebase your change onto 2.3.0 release though, wonder 
if that made a difference.
    
    Each of our consumers only had one receiver process, and if you meant "if 
receivers across all consumers die at the same time", it's not the case either. 
Receivers die one at a time randomly. We have monitoring set up so our 
scheduling system kills and restarts a consumer when it happens, but checking 
our restart history, we cant' find any obvious pattern, except that the issue 
disappeared immediately after rollbacking to unpatched version.
    
    
    Reading some documentation about spark internals, there should be a 
[receiver 
supervisor](https://jaceklaskowski.gitbooks.io/spark-streaming/content/spark-streaming-receiversupervisors.html)
 in charge of restarting receiver process, if you ran into the same issue in 
the future (hope not), it might worth digging into that.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to