So just a hunch because we’ve been dealing with something similar. When the
failure occurs, has the resource manager also failed over just recently or
in the previous 24 hours?

One thing to try: catch this exception and manually fail to the new
master/resource manager.

- Billy Watson

On Thu, Nov 29, 2018 at 21:16 Paul Lam <paullin3...@gmail.com> wrote:

> Hi,
>
> I’m running Flink applications on YARN 2.6.0-cdh5.6.0 and get a situation.
> After running for a while (could be longer than 7 days) the application
> might
> need to rescale up or recover from a node failure but it is not able to
> allocate new containers. All the incoming containers would fail to localize
> resources
> and create log aggregation dirs for lack of credentials, so the Flink
> application never gets the requested containers. It seems that the
> credentials in the
> container launch context somehow disappears.
>
> I find this looks very similar to FLINK-6376[1] and YARN-2704[2], but both
> of them should have been fixed. The Flink AM gets the hdfs delegation token
> from
>  the client, put it into the container launch context and will not refresh
> it afterwards. But IMHO, if the token is expired, the exception should be
> “token expired”
> or “token not found in cache”, but now what I get is “client cannot
> authenticate via [token, kerberos]”.
>
> This happens very randomly, and I have been struggling with it for couples
> of days. Any help would be greatly appreciated. Thanks a lot!
>
> [1] https://issues.apache.org/jira/browse/FLINK-6376
> [2] https://issues.apache.org/jira/browse/YARN-2704
>
> Best,
> Paul Lam
>
>
> --
William Watson

Reply via email to