Re: No resource available error while testing HA

Averell Tue, 29 Jan 2019 21:29:00 -0800

Hi Gary,

Thanks for the help.


Gary Yao-3 wrote
> You are writing that it takes YARN 10 minutes to restart the application
> master (AM). However, in my experiments the AM container is restarted
> within a
> few seconds when after killing the process. If in your setup YARN actually
> needs 10 minutes to restart the AM, then you could try increasing the
> number
> of retry attempts by the client [2].

I think that comes from the difference in how we tested. When I tried to
kill the JM process (using kill -9 pid) then a new process got created
within some seconds. However, when I tried to test by crashing the server
(using init 0), then it needed some time. I found the yarn-site parameter
for that timer: yarn.am.liveness-monitor.expiry-interval-ms, which is
default to 10 minutes [1]
I increased the rest client configuration as you suggested, and it did help.


Gary Yao-3 wrote
> The REST API that is queried by the Web UI returns the root cause from the
> ExecutionGraph [3]. All job status transitions should be logged together
> with
> the exception that caused the transition [4]. Check for INFO level log
> messages that start with "Job [...] switched from state" followed by a
> stacktrace. If you cannot find the exception, the problem might be rooted
> in
> your log4j or logback configuration.

Thanks. I got the point.
I am using logback. Tried to configure rolling logs, but not yet success
yet. Will need to try more.

Thanks and regards,
Averell

[1]
https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml#yarn.am.liveness-monitor.expiry-interval-ms
<https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml#yarn.am.liveness-monitor.expiry-interval-ms>
  



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: No resource available error while testing HA

Reply via email to