Re: Issue with single job yarn flink cluster HA

Ken Krugler Wed, 05 Aug 2020 17:22:05 -0700

Hi Dinesh,

Did updating to Flink 1.10 resolve the issue?


Thanks,

— Ken

> Hi Andrey,
> Sure We will try to use Flink 1.10 to see if HA issues we are facing is fixed 
> and update in this thread.
> 
> Thanks,
> Dinesh
> 
> On Thu, Apr 2, 2020 at 3:22 PM Andrey Zagrebin <azagre...@apache.org 
> <mailto:azagre...@apache.org>> wrote:
> Hi Dinesh,
> 
> Thanks for sharing the logs. There were couple of HA fixes since 1.7, e.g. 
> [1] and [2].
> I would suggest to try Flink 1.10.
> If the problem persists, could you also find the logs of the failed Job 
> Manager before the failover?
> 
> Best,
> Andrey
> 
> [1] https://jira.apache.org/jira/browse/FLINK-14316 
> <https://jira.apache.org/jira/browse/FLINK-14316>
> [2] https://jira.apache.org/jira/browse/FLINK-11843 
> <https://jira.apache.org/jira/browse/FLINK-11843>
> On Tue, Mar 31, 2020 at 6:49 AM Dinesh J <dineshj...@gmail.com 
> <mailto:dineshj...@gmail.com>> wrote:
> Hi Yang,
> I am attaching one full jobmanager log for a job which I reran today. This a 
> job that tries to read from savepoint.
> Same error message "leader election onging" is displayed. And this stays the 
> same even after 30 minutes. If I leave the job without yarn kill, it stays 
> the same forever.
> Based on your suggestions till now, I guess it might be some zookeeper 
> problem. If that is the case, what can I lookout for in zookeeper to figure 
> out the issue?
> 
> Thanks,
> Dinesh


[snip]

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Re: Issue with single job yarn flink cluster HA

Reply via email to