FYI,

I think I have gotten to the bottom this situation. For anyone who might be in 
situation hopefully my observations will help.

In my case, it had nothing to do with Flink Restart Strategy, it was doing it’s 
thing as expected. Issue really was, Kafka Producer timeout counters. As I 
mentioned in other thread, we have a capacity issue with our Kafka cluster that 
ends up causing some timeout in our Flink Applications (we do have throttle in 
place in Kafka to manage it better but still we run into timeout pretty often 
right unfortunately). 

We had set our Kafka Producer retries to 10. It seems like that retry counter 
never gets reset. So over life of an App if it hits 10 timeouts, it basically 
couldn’t start and went to a Failed state. I am yet to dig into whether this 
can be solved from Flink Kafka wrapper or not. But, for now we have set the 
retries to 0 and hopefully this situation will not happen.

If anyone has any similar observations pl feel free to share.

Thanks, Ashish

> On Jan 19, 2018, at 2:43 PM, ashish pok <ashish...@yahoo.com> wrote:
> 
> Team,
> 
> Hopefully, this is a quick one. 
> 
> We have setup restart strategy as follows in pretty much all of our apps:
> 
>     env.setRestartStrategy(RestartStrategies.fixedDelayRestart(10, 
> Time.of(30, TimeUnit.SECONDS)));
> 
> This seems pretty straight-forward. App should retry starting 10 times every 
> 30 seconds - so about 5 minutes. Either we are not understanding this or it 
> seems inconsistent. Some of the applications restart and come back fine on 
> issues like Kafka timeout (which I will come back to later) but in some cases 
> same issues pretty much shuts the app down. 
> 
> My first guess here was that total count of 10 is not reset after App 
> recovered normally. Is there a need to manually reset the counter in an App? 
> I doubt Flink would be treating it like a counter that spans the life of an 
> App instead of resetting on successful start-up - but not sure how else to 
> explain the behavior.
> 
> Along the same line, what actually constitutes as a "restart"? Our Kafka 
> cluster has known performance bottlenecks during certain times of day that we 
> are working to resolve. I do notice Kafka producer timeouts quite a few times 
> during these times. When App hits these timeouts, it does recover fine but I 
> dont necessary see entire application restarting as I dont see bootstrap logs 
> of my App. Does something like this count as a restart of App from Restart 
> Strategy perspective as well vs things like apps crashes/Yarn killing 
> application etc. where App is actually restarted from scratch?
> 
> We are really liking Flink, just need to hash out these operational issues to 
> make it prime time for all streaming apps we have in our cluster.
> 
> Thanks,
> 
> Ashish

Reply via email to