You need to figure out why the receivers failed in the first place. Look in your worker logs and see what really happened. When you run a streaming job continuously for longer period mostly there'll be a lot of logs (you can enable log rotation etc.) and if you are doing a groupBy, join, etc type of operations, then there will be a lot of shuffle data. So You need to check in the worker logs and see what happened (whether DISK full etc.), We have streaming pipelines running for weeks without having any issues.
Thanks Best Regards On Mon, Mar 16, 2015 at 12:40 PM, Jun Yang <yangjun...@gmail.com> wrote: > Guys, > > We have a project which builds upon Spark streaming. > > We use Kafka as the input stream, and create 5 receivers. > > When this application runs for around 90 hour, all the 5 receivers failed > for some unknown reasons. > > In my understanding, it is not guaranteed that Spark streaming receiver > will do fault recovery automatically. > > So I just want to figure out a way for doing fault-recovery to deal with > receiver failure. > > There is a JIRA post mentioned using StreamingLister for monitoring the > status of receiver: > > > https://issues.apache.org/jira/browse/SPARK-2381?focusedCommentId=14056836&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14056836 > > However I haven't found any open doc about how to do this stuff. > > Any guys have met the same issue and deal with it? > > Our environment: > Spark 1.3.0 > Dual Master Configuration > Kafka 0.8.2 > > Thanks > > -- > yangjun...@gmail.com > http://hi.baidu.com/yjpro >