Re: Jobmanagers are in a crash loop after upgrade from 1.12.2 to 1.13.1

Austin Cawley-Edwards Thu, 01 Jul 2021 08:49:39 -0700

Hi Shilpa,

I've confirmed that "recovered" jobs are not compatible between minor
versions of Flink (e.g., between 1.12 and 1.13). I believe the issue is
that the session cluster was upgraded to 1.13 without first stopping the
jobs running on it.


If this is the case, the workaround is to stop each job on the 1.12 session
cluster with a savepoint, upgrade the session cluster to 1.13, and then
resubmit each job with the desired savepoint.

Is that the case / does the procedure make sense?

Best,
Austin

On Thu, Jul 1, 2021 at 7:52 AM Shilpa Shankar <sshan...@bandwidth.com>
wrote:

> Hi Zhu,
>
> Does is mean our upgrades are going to fail and the jobs are not backward
> compatible?
> I did verify the job itself is built using 1.13.0.
>
> Is there a workaround for this?
>
> Thanks,
> Shilpa
>
>
> On Wed, Jun 30, 2021 at 11:14 PM Zhu Zhu <reed...@gmail.com> wrote:
>
>> Hi Shilpa,
>>
>> JobType was introduced in 1.13. So I guess the cause is that the client
>> which creates and submit
>> the job is still 1.12.2. The client generates a outdated job graph which
>> does not have its JobType
>> set and resulted in this NPE problem.
>>
>> Thanks,
>> Zhu
>>
>> Austin Cawley-Edwards <austin.caw...@gmail.com> 于2021年7月1日周四 上午1:54写道：
>>
>>> Hi Shilpa,
>>>
>>> Thanks for reaching out to the mailing list and providing those logs!
>>> The NullPointerException looks odd to me, but in order to better guess
>>> what's happening, can you tell me a little bit more about what your setup
>>> looks like? How are you deploying, i.e., standalone with your own
>>> manifests, the Kubernetes integration of the Flink CLI, some open-source
>>> operator, etc.?
>>>
>>> Also, are you using a High Availability setup for the JobManager?
>>>
>>> Best,
>>> Austin
>>>
>>>
>>> On Wed, Jun 30, 2021 at 12:31 PM Shilpa Shankar <sshan...@bandwidth.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> We have a flink session cluster in kubernetes running on 1.12.2. We
>>>> attempted an upgrade to v 1.13.1, but the jobmanager pods are continuously
>>>> restarting and are in a crash loop.
>>>>
>>>> Logs are attached for reference.
>>>>
>>>> How do we recover from this state?
>>>>
>>>> Thanks,
>>>> Shilpa
>>>>
>>>

Re: Jobmanagers are in a crash loop after upgrade from 1.12.2 to 1.13.1

Reply via email to