Jobmanagers are in a crash loop after upgrade from 1.12.2 to 1.13.1

2021-06-30 Thread Shilpa Shankar
Hello,

We have a flink session cluster in kubernetes running on 1.12.2. We
attempted an upgrade to v 1.13.1, but the jobmanager pods are continuously
restarting and are in a crash loop.

Logs are attached for reference.

How do we recover from this state?

Thanks,
Shilpa
2021-06-30 16:03:25,965 ERROR 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint[] - Fatal error 
occurred in the cluster entrypoint.
org.apache.flink.util.FlinkException: JobMaster for job 
a1fa9416058026ed7dffeafaf7c21c81 failed.
at 
org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:873)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
at 
org.apache.flink.runtime.dispatcher.Dispatcher.jobManagerRunnerFailed(Dispatcher.java:459)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
at 
org.apache.flink.runtime.dispatcher.Dispatcher.handleJobManagerRunnerResult(Dispatcher.java:436)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
at 
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$runJob$3(Dispatcher.java:415)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
at java.util.concurrent.CompletableFuture.uniHandle(Unknown Source) 
~[?:?]
at java.util.concurrent.CompletableFuture$UniHandle.tryFire(Unknown 
Source) ~[?:?]
at java.util.concurrent.CompletableFuture$Completion.run(Unknown 
Source) ~[?:?]
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) 
[flink-dist_2.12-1.13.1.jar:1.13.1]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) 
[flink-dist_2.12-1.13.1.jar:1.13.1]
at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) 
[flink-dist_2.12-1.13.1.jar:1.13.1]
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) 
[flink-dist_2.12-1.13.1.jar:1.13.1]
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) 
[flink-dist_2.12-1.13.1.jar:1.13.1]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) 
[flink-dist_2.12-1.13.1.jar:1.13.1]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) 
[flink-dist_2.12-1.13.1.jar:1.13.1]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) 
[flink-dist_2.12-1.13.1.jar:1.13.1]
at akka.actor.Actor.aroundReceive(Actor.scala:517) 
[flink-dist_2.12-1.13.1.jar:1.13.1]
at akka.actor.Actor.aroundReceive$(Actor.scala:515) 
[flink-dist_2.12-1.13.1.jar:1.13.1]
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) 
[flink-dist_2.12-1.13.1.jar:1.13.1]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) 
[flink-dist_2.12-1.13.1.jar:1.13.1]
at akka.actor.ActorCell.invoke(ActorCell.scala:561) 
[flink-dist_2.12-1.13.1.jar:1.13.1]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) 
[flink-dist_2.12-1.13.1.jar:1.13.1]
at akka.dispatch.Mailbox.run(Mailbox.scala:225) 
[flink-dist_2.12-1.13.1.jar:1.13.1]
at akka.dispatch.Mailbox.exec(Mailbox.scala:235) 
[flink-dist_2.12-1.13.1.jar:1.13.1]
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
[flink-dist_2.12-1.13.1.jar:1.13.1]
at 
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) 
[flink-dist_2.12-1.13.1.jar:1.13.1]
at 
akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
[flink-dist_2.12-1.13.1.jar:1.13.1]
at 
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 
[flink-dist_2.12-1.13.1.jar:1.13.1]
Caused by: org.apache.flink.runtime.client.JobInitializationException: Could 
not start the JobMaster.
at 
org.apache.flink.runtime.jobmaster.DefaultJobMasterServiceProcess.lambda$new$0(DefaultJobMasterServiceProcess.java:97)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
at java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown 
Source) ~[?:?]
at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source) 
~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(Unknown Source) 
~[?:?]
at java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown 
Source) ~[?:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) 
~[?:?]
at java.util.concurrent.FutureTask.run(Unknown Source) ~[?:?]
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown
 Source) ~[?:?]
at jav

Re: Jobmanagers are in a crash loop after upgrade from 1.12.2 to 1.13.1

2021-06-30 Thread Austin Cawley-Edwards
Hi Shilpa,

Thanks for reaching out to the mailing list and providing those logs! The
NullPointerException looks odd to me, but in order to better guess what's
happening, can you tell me a little bit more about what your setup looks
like? How are you deploying, i.e., standalone with your own manifests, the
Kubernetes integration of the Flink CLI, some open-source operator, etc.?

Also, are you using a High Availability setup for the JobManager?

Best,
Austin


On Wed, Jun 30, 2021 at 12:31 PM Shilpa Shankar 
wrote:

> Hello,
>
> We have a flink session cluster in kubernetes running on 1.12.2. We
> attempted an upgrade to v 1.13.1, but the jobmanager pods are continuously
> restarting and are in a crash loop.
>
> Logs are attached for reference.
>
> How do we recover from this state?
>
> Thanks,
> Shilpa
>


Re: Jobmanagers are in a crash loop after upgrade from 1.12.2 to 1.13.1

2021-06-30 Thread Zhu Zhu
Hi Shilpa,

JobType was introduced in 1.13. So I guess the cause is that the client
which creates and submit
the job is still 1.12.2. The client generates a outdated job graph which
does not have its JobType
set and resulted in this NPE problem.

Thanks,
Zhu

Austin Cawley-Edwards  于2021年7月1日周四 上午1:54写道:

> Hi Shilpa,
>
> Thanks for reaching out to the mailing list and providing those logs! The
> NullPointerException looks odd to me, but in order to better guess what's
> happening, can you tell me a little bit more about what your setup looks
> like? How are you deploying, i.e., standalone with your own manifests, the
> Kubernetes integration of the Flink CLI, some open-source operator, etc.?
>
> Also, are you using a High Availability setup for the JobManager?
>
> Best,
> Austin
>
>
> On Wed, Jun 30, 2021 at 12:31 PM Shilpa Shankar 
> wrote:
>
>> Hello,
>>
>> We have a flink session cluster in kubernetes running on 1.12.2. We
>> attempted an upgrade to v 1.13.1, but the jobmanager pods are continuously
>> restarting and are in a crash loop.
>>
>> Logs are attached for reference.
>>
>> How do we recover from this state?
>>
>> Thanks,
>> Shilpa
>>
>


Re: Jobmanagers are in a crash loop after upgrade from 1.12.2 to 1.13.1

2021-07-01 Thread Shilpa Shankar
Hi Zhu,

Does is mean our upgrades are going to fail and the jobs are not backward
compatible?
I did verify the job itself is built using 1.13.0.

Is there a workaround for this?

Thanks,
Shilpa


On Wed, Jun 30, 2021 at 11:14 PM Zhu Zhu  wrote:

> Hi Shilpa,
>
> JobType was introduced in 1.13. So I guess the cause is that the client
> which creates and submit
> the job is still 1.12.2. The client generates a outdated job graph which
> does not have its JobType
> set and resulted in this NPE problem.
>
> Thanks,
> Zhu
>
> Austin Cawley-Edwards  于2021年7月1日周四 上午1:54写道:
>
>> Hi Shilpa,
>>
>> Thanks for reaching out to the mailing list and providing those logs! The
>> NullPointerException looks odd to me, but in order to better guess what's
>> happening, can you tell me a little bit more about what your setup looks
>> like? How are you deploying, i.e., standalone with your own manifests, the
>> Kubernetes integration of the Flink CLI, some open-source operator, etc.?
>>
>> Also, are you using a High Availability setup for the JobManager?
>>
>> Best,
>> Austin
>>
>>
>> On Wed, Jun 30, 2021 at 12:31 PM Shilpa Shankar 
>> wrote:
>>
>>> Hello,
>>>
>>> We have a flink session cluster in kubernetes running on 1.12.2. We
>>> attempted an upgrade to v 1.13.1, but the jobmanager pods are continuously
>>> restarting and are in a crash loop.
>>>
>>> Logs are attached for reference.
>>>
>>> How do we recover from this state?
>>>
>>> Thanks,
>>> Shilpa
>>>
>>


Re: Jobmanagers are in a crash loop after upgrade from 1.12.2 to 1.13.1

2021-07-01 Thread Austin Cawley-Edwards
Hi Shilpa,

I've confirmed that "recovered" jobs are not compatible between minor
versions of Flink (e.g., between 1.12 and 1.13). I believe the issue is
that the session cluster was upgraded to 1.13 without first stopping the
jobs running on it.

If this is the case, the workaround is to stop each job on the 1.12 session
cluster with a savepoint, upgrade the session cluster to 1.13, and then
resubmit each job with the desired savepoint.

Is that the case / does the procedure make sense?

Best,
Austin

On Thu, Jul 1, 2021 at 7:52 AM Shilpa Shankar 
wrote:

> Hi Zhu,
>
> Does is mean our upgrades are going to fail and the jobs are not backward
> compatible?
> I did verify the job itself is built using 1.13.0.
>
> Is there a workaround for this?
>
> Thanks,
> Shilpa
>
>
> On Wed, Jun 30, 2021 at 11:14 PM Zhu Zhu  wrote:
>
>> Hi Shilpa,
>>
>> JobType was introduced in 1.13. So I guess the cause is that the client
>> which creates and submit
>> the job is still 1.12.2. The client generates a outdated job graph which
>> does not have its JobType
>> set and resulted in this NPE problem.
>>
>> Thanks,
>> Zhu
>>
>> Austin Cawley-Edwards  于2021年7月1日周四 上午1:54写道:
>>
>>> Hi Shilpa,
>>>
>>> Thanks for reaching out to the mailing list and providing those logs!
>>> The NullPointerException looks odd to me, but in order to better guess
>>> what's happening, can you tell me a little bit more about what your setup
>>> looks like? How are you deploying, i.e., standalone with your own
>>> manifests, the Kubernetes integration of the Flink CLI, some open-source
>>> operator, etc.?
>>>
>>> Also, are you using a High Availability setup for the JobManager?
>>>
>>> Best,
>>> Austin
>>>
>>>
>>> On Wed, Jun 30, 2021 at 12:31 PM Shilpa Shankar 
>>> wrote:
>>>
 Hello,

 We have a flink session cluster in kubernetes running on 1.12.2. We
 attempted an upgrade to v 1.13.1, but the jobmanager pods are continuously
 restarting and are in a crash loop.

 Logs are attached for reference.

 How do we recover from this state?

 Thanks,
 Shilpa

>>>