Hi Pawel, As far as I know, the application attempt is incremented if the application master fails and a new one is brought up. Therefore, what you are seeing should not happen. I have just deployed on AWS EMR 5.17.0 (Hadoop 2.8.4) and killed the container running the application master – the container id was not reused. Can you describe how to reproduce this behavior? Do you have a sample application? Can you observe this behavior consistently? Can you share the complete output of
yarn logs -applicationId <YOUR_APPLICATION_ID>? The call to the method setKeepContainersAcrossApplicationAttempts is needed to enable recovery of previously allocated TaskManager containers [1]. I currently do not see how it is possible to keep the AM container across application attempts. > The second challenge is understanding if the job will be restored into new > application attempts or new application attempt will just have flink running > without any job? The job will be restored if you have HA enabled [2][3]. Best, Gary [1] https://hortonworks.com/blog/apache-hadoop-yarn-hdp-2-2-fault-tolerance-features-long-running-services/ [2] https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/jobmanager_high_availability.html#yarn-cluster-high-availability [3] https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/deployment/yarn_setup.html#recovery-behavior-of-flink-on-yarn On Mon, Oct 8, 2018 at 12:32 PM Pawel Bartoszek <pawelbartosze...@gmail.com> wrote: > Hi, > > I am looking into the cause YARN starts new application attempt on Flink > 1.5.2. The challenge is getting the logs for the first attempt. After > checking YARN I discovered that in the first attempt and the second one > application manager (job manager) gets assigned the same container id (is > this expected ?) In this case logs from the first attempt are overwritten? > I found that *setKeepContainersAcrossApplicationAttempts* is enabled here > here > <https://github.com/apache/flink/blob/2ec72123e347e684ac40a1e1111a79a11211aadb/flink-yarn/src/main/java/org/apache/flink/yarn/AbstractYarnClusterDescriptor.java#L1340> > > The second challenge is understanding if the job will be restored into new > application attempts or new application attempt will just have flink > running without any job? > > > Regards, > Pawel > > *First attempt:* > > pawel_bartoszek@ip-10-4-X-X ~]$ yarn container -list > appattempt_1538570922803_0020_000001 > 18/10/08 10:16:16 INFO client.RMProxy: Connecting to ResourceManager at > ip-10-4-X-X.eu-west-1.compute.internal/10.4.108.26:8032 > Total number of containers :1 > Container-Id Start Time Finish Time > State Host Node Http Address > LOG-URL > container_1538570922803_0020_02_000001 Mon Oct 08 09:47:17 +0000 2018 > N/A RUNNING > ip-10-4-X-X.eu-west-1.compute.internal:8041 > http://ip-10-4-X-X.eu-west-1.compute.internal:8042 > http://ip-10-4-X-X.eu-west-1.compute.internal:8042/node/containerlogs/container_1538570922803_0020_02_000001/pawel_bartoszek > > *Second attempt:* > [pawel_bartoszek@ip-10-4-X-X ~]$ yarn container -list > appattempt_1538570922803_0020_000002 > 18/10/08 10:16:37 INFO client.RMProxy: Connecting to ResourceManager at > ip-10-4-X-X.eu-west-1.compute.internal/10.4.X.X:8032 > Total number of containers :1 > Container-Id Start Time Finish Time > State Host Node Http Address > LOG-URL > container_1538570922803_0020_02_000001 Mon Oct 08 09:47:17 +0000 2018 > N/A RUNNING > ip-10-4-X-X.eu-west-1.compute.internal:8041 > http://ip-10-4-X-X.eu-west-1.compute.internal:8042 > http://ip-10-4-X-X.eu-west-1.compute.internal:8042/node/containerlogs/container_1538570922803_0020_02_000001/pawel_bartoszek >