Thanks Piotr. This is helpful.

Thomas

On Mon, Jun 28, 2021 at 8:29 AM Piotr Nowojski <pnowoj...@apache.org> wrote:

> Hi,
>
> You should still be able to get the Flink logs via:
>
> > yarn logs -applicationId application_1623861596410_0010
>
> And it should give you more answers about what has happened.
>
> About the Flink and YARN behaviour, have you seen the documentation? [1]
> Especially this part:
>
> > Failed containers (including the JobManager) are replaced by YARN. The
> maximum number of JobManager container restarts is configured via
> yarn.application-attempts (default 1). The YARN Application will fail once
> all attempts are exhausted.
>
> ?
>
> Best,
> Piotrek
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/resource-providers/yarn/#flink-on-yarn-reference
>
> pon., 28 cze 2021 o 02:26 Thomas Wang <w...@datability.io> napisaƂ(a):
>
>> Just found some additional info. It looks like one of the EC2 instances
>> got terminated at the time the crash happened and this job had 7 Task
>> Managers running on that EC2 instance. Now I suspect it's possible
>> that when Yarn tried to migrate the Task Managers, there were no idle
>> containers as this job was using like 99% of the entire cluster. However in
>> that case shouldn't Yarn wait for containers to become available? I'm not
>> quite sure how Flink would behave in this case. Could someone provide some
>> insights here? Thanks.
>>
>> Thomas
>>
>> On Sun, Jun 27, 2021 at 4:24 PM Thomas Wang <w...@datability.io> wrote:
>>
>>> Hi,
>>>
>>> I recently experienced a job crash due to the underlying Yarn
>>> application failing for some reason. Here is the only error message I saw.
>>> It seems I can no longer see any of the Flink job logs.
>>>
>>> Application application_1623861596410_0010 failed 1 times (global limit
>>> =2; local limit is =1) due to ApplicationMaster for attempt
>>> appattempt_1623861596410_0010_000001 timed out. Failing the application.
>>>
>>> I was running the Flink job using the Yarn session mode with the
>>> following command.
>>>
>>> export HADOOP_CLASSPATH=`hadoop classpath` &&
>>> /usr/lib/flink/bin/yarn-session.sh -jm 7g -tm 7g -s 4 --detached
>>>
>>> I didn't have HA setup, but I believe the underlying Yarn application
>>> caused the crash because if, for some reason, the Flink job failed, the
>>> Yarn application should still survive. Please correct me if this is not the
>>> right assumption.
>>>
>>> My question is how I should find the root cause in this case and what's
>>> the recommended way to avoid this going forward?
>>>
>>> Thanks.
>>>
>>> Thomas
>>>
>>

Reply via email to