Re: Question about Spark Streaming Receiver Failure
Dibyendu, Thanks for the reply. I am reading your project homepage now. One quick question I care about is: If the receivers failed for some reasons(for example, killed brutally by someone else), is there any mechanism for the receiver to fail over automatically? On Mon, Mar 16, 2015 at 3:25 PM, Dibyendu Bhattacharya dibyendu.bhattach...@gmail.com wrote: Which version of Spark you are running ? You can try this Low Level Consumer : http://spark-packages.org/package/dibbhatt/kafka-spark-consumer This is designed to recover from various failures and have very good fault recovery mechanism built in. This is being used by many users and at present we at Pearson running this Receiver in Production for almost 3 months without any issue. You can give this a try. Regards, Dibyendu On Mon, Mar 16, 2015 at 12:47 PM, Akhil Das ak...@sigmoidanalytics.com wrote: You need to figure out why the receivers failed in the first place. Look in your worker logs and see what really happened. When you run a streaming job continuously for longer period mostly there'll be a lot of logs (you can enable log rotation etc.) and if you are doing a groupBy, join, etc type of operations, then there will be a lot of shuffle data. So You need to check in the worker logs and see what happened (whether DISK full etc.), We have streaming pipelines running for weeks without having any issues. Thanks Best Regards On Mon, Mar 16, 2015 at 12:40 PM, Jun Yang yangjun...@gmail.com wrote: Guys, We have a project which builds upon Spark streaming. We use Kafka as the input stream, and create 5 receivers. When this application runs for around 90 hour, all the 5 receivers failed for some unknown reasons. In my understanding, it is not guaranteed that Spark streaming receiver will do fault recovery automatically. So I just want to figure out a way for doing fault-recovery to deal with receiver failure. There is a JIRA post mentioned using StreamingLister for monitoring the status of receiver: https://issues.apache.org/jira/browse/SPARK-2381?focusedCommentId=14056836page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14056836 However I haven't found any open doc about how to do this stuff. Any guys have met the same issue and deal with it? Our environment: Spark 1.3.0 Dual Master Configuration Kafka 0.8.2 Thanks -- yangjun...@gmail.com http://hi.baidu.com/yjpro -- yangjun...@gmail.com http://hi.baidu.com/yjpro
Re: Question about Spark Streaming Receiver Failure
Akhil, I have checked the logs. There isn't any clue as to why the 5 receivers failed. That's why I just take it for granted that it will be a common issue for receiver failures, and we need to figure out a way to detect this kind of failure and do fail-over. Thanks On Mon, Mar 16, 2015 at 3:17 PM, Akhil Das ak...@sigmoidanalytics.com wrote: You need to figure out why the receivers failed in the first place. Look in your worker logs and see what really happened. When you run a streaming job continuously for longer period mostly there'll be a lot of logs (you can enable log rotation etc.) and if you are doing a groupBy, join, etc type of operations, then there will be a lot of shuffle data. So You need to check in the worker logs and see what happened (whether DISK full etc.), We have streaming pipelines running for weeks without having any issues. Thanks Best Regards On Mon, Mar 16, 2015 at 12:40 PM, Jun Yang yangjun...@gmail.com wrote: Guys, We have a project which builds upon Spark streaming. We use Kafka as the input stream, and create 5 receivers. When this application runs for around 90 hour, all the 5 receivers failed for some unknown reasons. In my understanding, it is not guaranteed that Spark streaming receiver will do fault recovery automatically. So I just want to figure out a way for doing fault-recovery to deal with receiver failure. There is a JIRA post mentioned using StreamingLister for monitoring the status of receiver: https://issues.apache.org/jira/browse/SPARK-2381?focusedCommentId=14056836page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14056836 However I haven't found any open doc about how to do this stuff. Any guys have met the same issue and deal with it? Our environment: Spark 1.3.0 Dual Master Configuration Kafka 0.8.2 Thanks -- yangjun...@gmail.com http://hi.baidu.com/yjpro -- yangjun...@gmail.com http://hi.baidu.com/yjpro
Re: Question about Spark Streaming Receiver Failure
You need to figure out why the receivers failed in the first place. Look in your worker logs and see what really happened. When you run a streaming job continuously for longer period mostly there'll be a lot of logs (you can enable log rotation etc.) and if you are doing a groupBy, join, etc type of operations, then there will be a lot of shuffle data. So You need to check in the worker logs and see what happened (whether DISK full etc.), We have streaming pipelines running for weeks without having any issues. Thanks Best Regards On Mon, Mar 16, 2015 at 12:40 PM, Jun Yang yangjun...@gmail.com wrote: Guys, We have a project which builds upon Spark streaming. We use Kafka as the input stream, and create 5 receivers. When this application runs for around 90 hour, all the 5 receivers failed for some unknown reasons. In my understanding, it is not guaranteed that Spark streaming receiver will do fault recovery automatically. So I just want to figure out a way for doing fault-recovery to deal with receiver failure. There is a JIRA post mentioned using StreamingLister for monitoring the status of receiver: https://issues.apache.org/jira/browse/SPARK-2381?focusedCommentId=14056836page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14056836 However I haven't found any open doc about how to do this stuff. Any guys have met the same issue and deal with it? Our environment: Spark 1.3.0 Dual Master Configuration Kafka 0.8.2 Thanks -- yangjun...@gmail.com http://hi.baidu.com/yjpro
Re: Question about Spark Streaming Receiver Failure
Which version of Spark you are running ? You can try this Low Level Consumer : http://spark-packages.org/package/dibbhatt/kafka-spark-consumer This is designed to recover from various failures and have very good fault recovery mechanism built in. This is being used by many users and at present we at Pearson running this Receiver in Production for almost 3 months without any issue. You can give this a try. Regards, Dibyendu On Mon, Mar 16, 2015 at 12:47 PM, Akhil Das ak...@sigmoidanalytics.com wrote: You need to figure out why the receivers failed in the first place. Look in your worker logs and see what really happened. When you run a streaming job continuously for longer period mostly there'll be a lot of logs (you can enable log rotation etc.) and if you are doing a groupBy, join, etc type of operations, then there will be a lot of shuffle data. So You need to check in the worker logs and see what happened (whether DISK full etc.), We have streaming pipelines running for weeks without having any issues. Thanks Best Regards On Mon, Mar 16, 2015 at 12:40 PM, Jun Yang yangjun...@gmail.com wrote: Guys, We have a project which builds upon Spark streaming. We use Kafka as the input stream, and create 5 receivers. When this application runs for around 90 hour, all the 5 receivers failed for some unknown reasons. In my understanding, it is not guaranteed that Spark streaming receiver will do fault recovery automatically. So I just want to figure out a way for doing fault-recovery to deal with receiver failure. There is a JIRA post mentioned using StreamingLister for monitoring the status of receiver: https://issues.apache.org/jira/browse/SPARK-2381?focusedCommentId=14056836page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14056836 However I haven't found any open doc about how to do this stuff. Any guys have met the same issue and deal with it? Our environment: Spark 1.3.0 Dual Master Configuration Kafka 0.8.2 Thanks -- yangjun...@gmail.com http://hi.baidu.com/yjpro
Re: Question about Spark Streaming Receiver Failure
Yes.. Auto restart is enabled in my low level consumer ..when there is some unhandled exception comes... Even if you see KafkaConsumer.java, for some cases ( like broker failure, kafka leader changes etc ) it can even refresh the Consumer (The Coordinator which talks to a Leader) which will recover from those failures.. Dib On Mon, Mar 16, 2015 at 1:40 PM, Jun Yang yangjun...@gmail.com wrote: I have checked Dibyendu's code, it looks that his implementation has auto-restart mechanism: src/main/java/consumer/kafka/client/KafkaReceiver.java: private void start() { // Start the thread that receives data over a connection KafkaConfig kafkaConfig = new KafkaConfig(_props); ZkState zkState = new ZkState(kafkaConfig); _kConsumer = new KafkaConsumer(kafkaConfig, zkState, this); _kConsumer.open(_partitionId); Thread.UncaughtExceptionHandler eh = new Thread.UncaughtExceptionHandler() { public void uncaughtException(Thread th, Throwable ex) { restart(Restarting Receiver for Partition + _partitionId , ex, 5000); } }; _consumerThread = new Thread(_kConsumer); _consumerThread.setDaemon(true); _consumerThread.setUncaughtExceptionHandler(eh); _consumerThread.start(); } I also checked Spark's native Kafka Receiver implementation, and it looks not have any auto-restart support. Any comments from Dibyendu? On Mon, Mar 16, 2015 at 3:39 PM, Akhil Das ak...@sigmoidanalytics.com wrote: As i seen, once i kill my receiver on one machine, it will automatically spawn another receiver on another machine or on the same machine. Thanks Best Regards On Mon, Mar 16, 2015 at 1:08 PM, Jun Yang yangjun...@gmail.com wrote: Dibyendu, Thanks for the reply. I am reading your project homepage now. One quick question I care about is: If the receivers failed for some reasons(for example, killed brutally by someone else), is there any mechanism for the receiver to fail over automatically? On Mon, Mar 16, 2015 at 3:25 PM, Dibyendu Bhattacharya dibyendu.bhattach...@gmail.com wrote: Which version of Spark you are running ? You can try this Low Level Consumer : http://spark-packages.org/package/dibbhatt/kafka-spark-consumer This is designed to recover from various failures and have very good fault recovery mechanism built in. This is being used by many users and at present we at Pearson running this Receiver in Production for almost 3 months without any issue. You can give this a try. Regards, Dibyendu On Mon, Mar 16, 2015 at 12:47 PM, Akhil Das ak...@sigmoidanalytics.com wrote: You need to figure out why the receivers failed in the first place. Look in your worker logs and see what really happened. When you run a streaming job continuously for longer period mostly there'll be a lot of logs (you can enable log rotation etc.) and if you are doing a groupBy, join, etc type of operations, then there will be a lot of shuffle data. So You need to check in the worker logs and see what happened (whether DISK full etc.), We have streaming pipelines running for weeks without having any issues. Thanks Best Regards On Mon, Mar 16, 2015 at 12:40 PM, Jun Yang yangjun...@gmail.com wrote: Guys, We have a project which builds upon Spark streaming. We use Kafka as the input stream, and create 5 receivers. When this application runs for around 90 hour, all the 5 receivers failed for some unknown reasons. In my understanding, it is not guaranteed that Spark streaming receiver will do fault recovery automatically. So I just want to figure out a way for doing fault-recovery to deal with receiver failure. There is a JIRA post mentioned using StreamingLister for monitoring the status of receiver: https://issues.apache.org/jira/browse/SPARK-2381?focusedCommentId=14056836page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14056836 However I haven't found any open doc about how to do this stuff. Any guys have met the same issue and deal with it? Our environment: Spark 1.3.0 Dual Master Configuration Kafka 0.8.2 Thanks -- yangjun...@gmail.com http://hi.baidu.com/yjpro -- yangjun...@gmail.com http://hi.baidu.com/yjpro -- yangjun...@gmail.com http://hi.baidu.com/yjpro
Re: Question about Spark Streaming Receiver Failure
I have checked Dibyendu's code, it looks that his implementation has auto-restart mechanism: src/main/java/consumer/kafka/client/KafkaReceiver.java: private void start() { // Start the thread that receives data over a connection KafkaConfig kafkaConfig = new KafkaConfig(_props); ZkState zkState = new ZkState(kafkaConfig); _kConsumer = new KafkaConsumer(kafkaConfig, zkState, this); _kConsumer.open(_partitionId); Thread.UncaughtExceptionHandler eh = new Thread.UncaughtExceptionHandler() { public void uncaughtException(Thread th, Throwable ex) { restart(Restarting Receiver for Partition + _partitionId , ex, 5000); } }; _consumerThread = new Thread(_kConsumer); _consumerThread.setDaemon(true); _consumerThread.setUncaughtExceptionHandler(eh); _consumerThread.start(); } I also checked Spark's native Kafka Receiver implementation, and it looks not have any auto-restart support. Any comments from Dibyendu? On Mon, Mar 16, 2015 at 3:39 PM, Akhil Das ak...@sigmoidanalytics.com wrote: As i seen, once i kill my receiver on one machine, it will automatically spawn another receiver on another machine or on the same machine. Thanks Best Regards On Mon, Mar 16, 2015 at 1:08 PM, Jun Yang yangjun...@gmail.com wrote: Dibyendu, Thanks for the reply. I am reading your project homepage now. One quick question I care about is: If the receivers failed for some reasons(for example, killed brutally by someone else), is there any mechanism for the receiver to fail over automatically? On Mon, Mar 16, 2015 at 3:25 PM, Dibyendu Bhattacharya dibyendu.bhattach...@gmail.com wrote: Which version of Spark you are running ? You can try this Low Level Consumer : http://spark-packages.org/package/dibbhatt/kafka-spark-consumer This is designed to recover from various failures and have very good fault recovery mechanism built in. This is being used by many users and at present we at Pearson running this Receiver in Production for almost 3 months without any issue. You can give this a try. Regards, Dibyendu On Mon, Mar 16, 2015 at 12:47 PM, Akhil Das ak...@sigmoidanalytics.com wrote: You need to figure out why the receivers failed in the first place. Look in your worker logs and see what really happened. When you run a streaming job continuously for longer period mostly there'll be a lot of logs (you can enable log rotation etc.) and if you are doing a groupBy, join, etc type of operations, then there will be a lot of shuffle data. So You need to check in the worker logs and see what happened (whether DISK full etc.), We have streaming pipelines running for weeks without having any issues. Thanks Best Regards On Mon, Mar 16, 2015 at 12:40 PM, Jun Yang yangjun...@gmail.com wrote: Guys, We have a project which builds upon Spark streaming. We use Kafka as the input stream, and create 5 receivers. When this application runs for around 90 hour, all the 5 receivers failed for some unknown reasons. In my understanding, it is not guaranteed that Spark streaming receiver will do fault recovery automatically. So I just want to figure out a way for doing fault-recovery to deal with receiver failure. There is a JIRA post mentioned using StreamingLister for monitoring the status of receiver: https://issues.apache.org/jira/browse/SPARK-2381?focusedCommentId=14056836page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14056836 However I haven't found any open doc about how to do this stuff. Any guys have met the same issue and deal with it? Our environment: Spark 1.3.0 Dual Master Configuration Kafka 0.8.2 Thanks -- yangjun...@gmail.com http://hi.baidu.com/yjpro -- yangjun...@gmail.com http://hi.baidu.com/yjpro -- yangjun...@gmail.com http://hi.baidu.com/yjpro