Hi Pawel,

As far as I know, the application attempt is incremented if the application
master fails and a new one is brought up. Therefore, what you are seeing
should not happen. I have just deployed on AWS EMR 5.17.0 (Hadoop 2.8.4) and
killed the container running the application master – the container id was
not
reused. Can you describe how to reproduce this behavior? Do you have a
sample
application? Can you observe this behavior consistently? Can you share the
complete output of

    yarn logs -applicationId <YOUR_APPLICATION_ID>?

The call to the method setKeepContainersAcrossApplicationAttempts is needed
to
enable recovery of previously allocated TaskManager containers [1]. I
currently do not see how it is possible to keep the AM container across
application attempts.

> The second challenge is understanding if the job will be restored into new
> application attempts or new application attempt will just have flink
running
> without any job?

The job will be restored if you have HA enabled [2][3].

Best,
Gary

[1]
https://hortonworks.com/blog/apache-hadoop-yarn-hdp-2-2-fault-tolerance-features-long-running-services/
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/jobmanager_high_availability.html#yarn-cluster-high-availability
[3]
https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/deployment/yarn_setup.html#recovery-behavior-of-flink-on-yarn

On Mon, Oct 8, 2018 at 12:32 PM Pawel Bartoszek <pawelbartosze...@gmail.com>
wrote:

> Hi,
>
> I am looking into the cause YARN starts new application attempt on Flink
> 1.5.2. The challenge is getting the logs for the first attempt. After
> checking YARN I discovered that in the first attempt and the second one
> application manager (job manager) gets assigned the same container id (is
> this expected ?)  In this case logs from the first attempt are overwritten?
> I found that *setKeepContainersAcrossApplicationAttempts* is enabled here
> here
> <https://github.com/apache/flink/blob/2ec72123e347e684ac40a1e1111a79a11211aadb/flink-yarn/src/main/java/org/apache/flink/yarn/AbstractYarnClusterDescriptor.java#L1340>
>
> The second challenge is understanding if the job will be restored into new
> application attempts or new application attempt will just have flink
> running without any job?
>
>
> Regards,
> Pawel
>
> *First attempt:*
>
> pawel_bartoszek@ip-10-4-X-X ~]$ yarn container -list
> appattempt_1538570922803_0020_000001
> 18/10/08 10:16:16 INFO client.RMProxy: Connecting to ResourceManager at
> ip-10-4-X-X.eu-west-1.compute.internal/10.4.108.26:8032
> Total number of containers :1
>                   Container-Id           Start Time          Finish Time
>              State                 Host    Node Http Address
>                 LOG-URL
> container_1538570922803_0020_02_000001 Mon Oct 08 09:47:17 +0000 2018
>              N/A              RUNNING
> ip-10-4-X-X.eu-west-1.compute.internal:8041
> http://ip-10-4-X-X.eu-west-1.compute.internal:8042
> http://ip-10-4-X-X.eu-west-1.compute.internal:8042/node/containerlogs/container_1538570922803_0020_02_000001/pawel_bartoszek
>
> *Second attempt:*
> [pawel_bartoszek@ip-10-4-X-X ~]$ yarn container -list
> appattempt_1538570922803_0020_000002
> 18/10/08 10:16:37 INFO client.RMProxy: Connecting to ResourceManager at
> ip-10-4-X-X.eu-west-1.compute.internal/10.4.X.X:8032
> Total number of containers :1
>                   Container-Id           Start Time          Finish Time
>              State                 Host    Node Http Address
>                 LOG-URL
> container_1538570922803_0020_02_000001 Mon Oct 08 09:47:17 +0000 2018
>              N/A              RUNNING
> ip-10-4-X-X.eu-west-1.compute.internal:8041
> http://ip-10-4-X-X.eu-west-1.compute.internal:8042
> http://ip-10-4-X-X.eu-west-1.compute.internal:8042/node/containerlogs/container_1538570922803_0020_02_000001/pawel_bartoszek
>

Reply via email to