Re: long lived standalone job session cluster in kubernetes
Hi Heath, I think some of the PRs are already open and ready for review [1, 2]. [1] https://issues.apache.org/jira/browse/FLINK-10932 [2] https://issues.apache.org/jira/browse/FLINK-10935 Cheers, Till On Wed, Feb 27, 2019 at 10:48 AM Heath Albritton wrote: > Great, my team is eager to get started. I’m curious what progress had > been made so far? > > -H > > On Feb 26, 2019, at 14:43, Chunhui Shi wrote: > > Hi Heath and Till, thanks for offering help on reviewing this feature. I > just reassigned the JIRAs to myself after offline discussion with Jin. Let > us work together to get kubernetes integrated natively with flink. Thanks. > > On Fri, Feb 15, 2019 at 12:19 AM Till Rohrmann > wrote: > >> Alright, I'll get back to you once the PRs are open. Thanks a lot for >> your help :-) >> >> Cheers, >> Till >> >> On Thu, Feb 14, 2019 at 5:45 PM Heath Albritton >> wrote: >> >>> My team and I are keen to help out with testing and review as soon as >>> there is a pill request. >>> >>> -H >>> >>> On Feb 11, 2019, at 00:26, Till Rohrmann wrote: >>> >>> Hi Heath, >>> >>> I just learned that people from Alibaba already made some good progress >>> with FLINK-9953. I'm currently talking to them in order to see how we can >>> merge this contribution into Flink as fast as possible. Since I'm quite >>> busy due to the upcoming release I hope that other community members will >>> help out with the reviewing once the PRs are opened. >>> >>> Cheers, >>> Till >>> >>> On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton >>> wrote: >>> Has any progress been made on this? There are a number of folks in the community looking to help out. -H On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann wrote: > > Hi Derek, > > there is this issue [1] which tracks the active Kubernetes integration. Jin Sun already started implementing some parts of it. There should also be some PRs open for it. Please check them out. > > [1] https://issues.apache.org/jira/browse/FLINK-9953 > > Cheers, > Till > > On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee wrote: >> >> Sounds good. >> >> Is someone working on this automation today? >> >> If not, although my time is tight, I may be able to work on a PR for getting us started down the path Kubernetes native cluster mode. >> >> >> On 12/4/18 5:35 AM, Till Rohrmann wrote: >> >> Hi Derek, >> >> what I would recommend to use is to trigger the cancel with savepoint command [1]. This will create a savepoint and terminate the job execution. Next you simply need to respawn the job cluster which you provide with the savepoint to resume from. >> >> [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint >> >> Cheers, >> Till >> >> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin < and...@data-artisans.com> wrote: >>> >>> Hi Derek, >>> >>> I think your automation steps look good. >>> Recreating deployments should not take long >>> and as you mention, this way you can avoid unpredictable old/new version collisions. >>> >>> Best, >>> Andrey >>> >>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz wrote: >>> > >>> > Hi Derek, >>> > >>> > I am not an expert in kubernetes, so I will cc Till, who should be able >>> > to help you more. >>> > >>> > As for the automation for similar process I would recommend having a >>> > look at dA platform[1] which is built on top of kubernetes. >>> > >>> > Best, >>> > >>> > Dawid >>> > >>> > [1] https://data-artisans.com/platform-overview >>> > >>> > On 30/11/2018 02:10, Derek VerLee wrote: >>> >> >>> >> I'm looking at the job cluster mode, it looks great and I and >>> >> considering migrating our jobs off our "legacy" session cluster and >>> >> into Kubernetes. >>> >> >>> >> I do need to ask some questions because I haven't found a lot of >>> >> details in the documentation about how it works yet, and I gave up >>> >> following the the DI around in the code after a while. >>> >> >>> >> Let's say I have a deployment for the job "leader" in HA with ZK, and >>> >> another deployment for the taskmanagers. >>> >> >>> >> I want to upgrade the code or configuration and start from a >>> >> savepoint, in an automated way. >>> >> >>> >> Best I can figure, I can not just update the deployment resources in >>> >> kubernetes and allow the containers to restart in an arbitrary order. >>> >> >>> >> Instead, I expect sequencing is important, something along the lines >>> >> of this: >>> >> >>> >> 1. issue savepoint command on leader >>
Re: long lived standalone job session cluster in kubernetes
Great, my team is eager to get started. I’m curious what progress had been made so far? -H > On Feb 26, 2019, at 14:43, Chunhui Shi wrote: > > Hi Heath and Till, thanks for offering help on reviewing this feature. I just > reassigned the JIRAs to myself after offline discussion with Jin. Let us work > together to get kubernetes integrated natively with flink. Thanks. > >> On Fri, Feb 15, 2019 at 12:19 AM Till Rohrmann wrote: >> Alright, I'll get back to you once the PRs are open. Thanks a lot for your >> help :-) >> >> Cheers, >> Till >> >>> On Thu, Feb 14, 2019 at 5:45 PM Heath Albritton wrote: >>> My team and I are keen to help out with testing and review as soon as there >>> is a pill request. >>> >>> -H >>> On Feb 11, 2019, at 00:26, Till Rohrmann wrote: Hi Heath, I just learned that people from Alibaba already made some good progress with FLINK-9953. I'm currently talking to them in order to see how we can merge this contribution into Flink as fast as possible. Since I'm quite busy due to the upcoming release I hope that other community members will help out with the reviewing once the PRs are opened. Cheers, Till > On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton wrote: > Has any progress been made on this? There are a number of folks in > the community looking to help out. > > > -H > > On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann > wrote: > > > > Hi Derek, > > > > there is this issue [1] which tracks the active Kubernetes integration. > > Jin Sun already started implementing some parts of it. There should > > also be some PRs open for it. Please check them out. > > > > [1] https://issues.apache.org/jira/browse/FLINK-9953 > > > > Cheers, > > Till > > > > On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee > > wrote: > >> > >> Sounds good. > >> > >> Is someone working on this automation today? > >> > >> If not, although my time is tight, I may be able to work on a PR for > >> getting us started down the path Kubernetes native cluster mode. > >> > >> > >> On 12/4/18 5:35 AM, Till Rohrmann wrote: > >> > >> Hi Derek, > >> > >> what I would recommend to use is to trigger the cancel with savepoint > >> command [1]. This will create a savepoint and terminate the job > >> execution. Next you simply need to respawn the job cluster which you > >> provide with the savepoint to resume from. > >> > >> [1] > >> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint > >> > >> Cheers, > >> Till > >> > >> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin > >> wrote: > >>> > >>> Hi Derek, > >>> > >>> I think your automation steps look good. > >>> Recreating deployments should not take long > >>> and as you mention, this way you can avoid unpredictable old/new > >>> version collisions. > >>> > >>> Best, > >>> Andrey > >>> > >>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz > >>> > wrote: > >>> > > >>> > Hi Derek, > >>> > > >>> > I am not an expert in kubernetes, so I will cc Till, who should be > >>> > able > >>> > to help you more. > >>> > > >>> > As for the automation for similar process I would recommend having a > >>> > look at dA platform[1] which is built on top of kubernetes. > >>> > > >>> > Best, > >>> > > >>> > Dawid > >>> > > >>> > [1] https://data-artisans.com/platform-overview > >>> > > >>> > On 30/11/2018 02:10, Derek VerLee wrote: > >>> >> > >>> >> I'm looking at the job cluster mode, it looks great and I and > >>> >> considering migrating our jobs off our "legacy" session cluster and > >>> >> into Kubernetes. > >>> >> > >>> >> I do need to ask some questions because I haven't found a lot of > >>> >> details in the documentation about how it works yet, and I gave up > >>> >> following the the DI around in the code after a while. > >>> >> > >>> >> Let's say I have a deployment for the job "leader" in HA with ZK, > >>> >> and > >>> >> another deployment for the taskmanagers. > >>> >> > >>> >> I want to upgrade the code or configuration and start from a > >>> >> savepoint, in an automated way. > >>> >> > >>> >> Best I can figure, I can not just update the deployment resources > >>> >> in > >>> >> kubernetes and allow the containers to restart in an arbitrary > >>> >> order. > >>> >> > >>> >> Instead, I expect sequencing is important, something along the > >>> >> lines > >>> >> of this: > >>> >> > >>> >> 1. issue savepoint command on leader > >>> >> 2. wait for savepoint > >>> >> 3. destroy all leader and taskmanager containers > >>
Re: long lived standalone job session cluster in kubernetes
Hi Heath and Till, thanks for offering help on reviewing this feature. I just reassigned the JIRAs to myself after offline discussion with Jin. Let us work together to get kubernetes integrated natively with flink. Thanks. On Fri, Feb 15, 2019 at 12:19 AM Till Rohrmann wrote: > Alright, I'll get back to you once the PRs are open. Thanks a lot for your > help :-) > > Cheers, > Till > > On Thu, Feb 14, 2019 at 5:45 PM Heath Albritton wrote: > >> My team and I are keen to help out with testing and review as soon as >> there is a pill request. >> >> -H >> >> On Feb 11, 2019, at 00:26, Till Rohrmann wrote: >> >> Hi Heath, >> >> I just learned that people from Alibaba already made some good progress >> with FLINK-9953. I'm currently talking to them in order to see how we can >> merge this contribution into Flink as fast as possible. Since I'm quite >> busy due to the upcoming release I hope that other community members will >> help out with the reviewing once the PRs are opened. >> >> Cheers, >> Till >> >> On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton wrote: >> >>> Has any progress been made on this? There are a number of folks in >>> the community looking to help out. >>> >>> >>> -H >>> >>> On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann >>> wrote: >>> > >>> > Hi Derek, >>> > >>> > there is this issue [1] which tracks the active Kubernetes >>> integration. Jin Sun already started implementing some parts of it. There >>> should also be some PRs open for it. Please check them out. >>> > >>> > [1] https://issues.apache.org/jira/browse/FLINK-9953 >>> > >>> > Cheers, >>> > Till >>> > >>> > On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee >>> wrote: >>> >> >>> >> Sounds good. >>> >> >>> >> Is someone working on this automation today? >>> >> >>> >> If not, although my time is tight, I may be able to work on a PR for >>> getting us started down the path Kubernetes native cluster mode. >>> >> >>> >> >>> >> On 12/4/18 5:35 AM, Till Rohrmann wrote: >>> >> >>> >> Hi Derek, >>> >> >>> >> what I would recommend to use is to trigger the cancel with savepoint >>> command [1]. This will create a savepoint and terminate the job execution. >>> Next you simply need to respawn the job cluster which you provide with the >>> savepoint to resume from. >>> >> >>> >> [1] >>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint >>> >> >>> >> Cheers, >>> >> Till >>> >> >>> >> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin < >>> and...@data-artisans.com> wrote: >>> >>> >>> >>> Hi Derek, >>> >>> >>> >>> I think your automation steps look good. >>> >>> Recreating deployments should not take long >>> >>> and as you mention, this way you can avoid unpredictable old/new >>> version collisions. >>> >>> >>> >>> Best, >>> >>> Andrey >>> >>> >>> >>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz >>> wrote: >>> >>> > >>> >>> > Hi Derek, >>> >>> > >>> >>> > I am not an expert in kubernetes, so I will cc Till, who should be >>> able >>> >>> > to help you more. >>> >>> > >>> >>> > As for the automation for similar process I would recommend having >>> a >>> >>> > look at dA platform[1] which is built on top of kubernetes. >>> >>> > >>> >>> > Best, >>> >>> > >>> >>> > Dawid >>> >>> > >>> >>> > [1] https://data-artisans.com/platform-overview >>> >>> > >>> >>> > On 30/11/2018 02:10, Derek VerLee wrote: >>> >>> >> >>> >>> >> I'm looking at the job cluster mode, it looks great and I and >>> >>> >> considering migrating our jobs off our "legacy" session cluster >>> and >>> >>> >> into Kubernetes. >>> >>> >> >>> >>> >> I do need to ask some questions because I haven't found a lot of >>> >>> >> details in the documentation about how it works yet, and I gave up >>> >>> >> following the the DI around in the code after a while. >>> >>> >> >>> >>> >> Let's say I have a deployment for the job "leader" in HA with ZK, >>> and >>> >>> >> another deployment for the taskmanagers. >>> >>> >> >>> >>> >> I want to upgrade the code or configuration and start from a >>> >>> >> savepoint, in an automated way. >>> >>> >> >>> >>> >> Best I can figure, I can not just update the deployment resources >>> in >>> >>> >> kubernetes and allow the containers to restart in an arbitrary >>> order. >>> >>> >> >>> >>> >> Instead, I expect sequencing is important, something along the >>> lines >>> >>> >> of this: >>> >>> >> >>> >>> >> 1. issue savepoint command on leader >>> >>> >> 2. wait for savepoint >>> >>> >> 3. destroy all leader and taskmanager containers >>> >>> >> 4. deploy new leader, with savepoint url >>> >>> >> 5. deploy new taskmanagers >>> >>> >> >>> >>> >> >>> >>> >> For example, I imagine old taskmanagers (with an old version of my >>> >>> >> job) attaching to the new leader and causing a problem. >>> >>> >> >>> >>> >> Does that sound right, or am I overthinking it? >>> >>> >> >>> >>> >> If not, has anyone tried implementing any automation for this yet? >>> >>> >> >>> >>> > >>> >>> >>> >>
Re: long lived standalone job session cluster in kubernetes
Alright, I'll get back to you once the PRs are open. Thanks a lot for your help :-) Cheers, Till On Thu, Feb 14, 2019 at 5:45 PM Heath Albritton wrote: > My team and I are keen to help out with testing and review as soon as > there is a pill request. > > -H > > On Feb 11, 2019, at 00:26, Till Rohrmann wrote: > > Hi Heath, > > I just learned that people from Alibaba already made some good progress > with FLINK-9953. I'm currently talking to them in order to see how we can > merge this contribution into Flink as fast as possible. Since I'm quite > busy due to the upcoming release I hope that other community members will > help out with the reviewing once the PRs are opened. > > Cheers, > Till > > On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton wrote: > >> Has any progress been made on this? There are a number of folks in >> the community looking to help out. >> >> >> -H >> >> On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann >> wrote: >> > >> > Hi Derek, >> > >> > there is this issue [1] which tracks the active Kubernetes integration. >> Jin Sun already started implementing some parts of it. There should also be >> some PRs open for it. Please check them out. >> > >> > [1] https://issues.apache.org/jira/browse/FLINK-9953 >> > >> > Cheers, >> > Till >> > >> > On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee >> wrote: >> >> >> >> Sounds good. >> >> >> >> Is someone working on this automation today? >> >> >> >> If not, although my time is tight, I may be able to work on a PR for >> getting us started down the path Kubernetes native cluster mode. >> >> >> >> >> >> On 12/4/18 5:35 AM, Till Rohrmann wrote: >> >> >> >> Hi Derek, >> >> >> >> what I would recommend to use is to trigger the cancel with savepoint >> command [1]. This will create a savepoint and terminate the job execution. >> Next you simply need to respawn the job cluster which you provide with the >> savepoint to resume from. >> >> >> >> [1] >> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint >> >> >> >> Cheers, >> >> Till >> >> >> >> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin < >> and...@data-artisans.com> wrote: >> >>> >> >>> Hi Derek, >> >>> >> >>> I think your automation steps look good. >> >>> Recreating deployments should not take long >> >>> and as you mention, this way you can avoid unpredictable old/new >> version collisions. >> >>> >> >>> Best, >> >>> Andrey >> >>> >> >>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz >> wrote: >> >>> > >> >>> > Hi Derek, >> >>> > >> >>> > I am not an expert in kubernetes, so I will cc Till, who should be >> able >> >>> > to help you more. >> >>> > >> >>> > As for the automation for similar process I would recommend having a >> >>> > look at dA platform[1] which is built on top of kubernetes. >> >>> > >> >>> > Best, >> >>> > >> >>> > Dawid >> >>> > >> >>> > [1] https://data-artisans.com/platform-overview >> >>> > >> >>> > On 30/11/2018 02:10, Derek VerLee wrote: >> >>> >> >> >>> >> I'm looking at the job cluster mode, it looks great and I and >> >>> >> considering migrating our jobs off our "legacy" session cluster and >> >>> >> into Kubernetes. >> >>> >> >> >>> >> I do need to ask some questions because I haven't found a lot of >> >>> >> details in the documentation about how it works yet, and I gave up >> >>> >> following the the DI around in the code after a while. >> >>> >> >> >>> >> Let's say I have a deployment for the job "leader" in HA with ZK, >> and >> >>> >> another deployment for the taskmanagers. >> >>> >> >> >>> >> I want to upgrade the code or configuration and start from a >> >>> >> savepoint, in an automated way. >> >>> >> >> >>> >> Best I can figure, I can not just update the deployment resources >> in >> >>> >> kubernetes and allow the containers to restart in an arbitrary >> order. >> >>> >> >> >>> >> Instead, I expect sequencing is important, something along the >> lines >> >>> >> of this: >> >>> >> >> >>> >> 1. issue savepoint command on leader >> >>> >> 2. wait for savepoint >> >>> >> 3. destroy all leader and taskmanager containers >> >>> >> 4. deploy new leader, with savepoint url >> >>> >> 5. deploy new taskmanagers >> >>> >> >> >>> >> >> >>> >> For example, I imagine old taskmanagers (with an old version of my >> >>> >> job) attaching to the new leader and causing a problem. >> >>> >> >> >>> >> Does that sound right, or am I overthinking it? >> >>> >> >> >>> >> If not, has anyone tried implementing any automation for this yet? >> >>> >> >> >>> > >> >>> >> >
Re: long lived standalone job session cluster in kubernetes
My team and I are keen to help out with testing and review as soon as there is a pill request. -H > On Feb 11, 2019, at 00:26, Till Rohrmann wrote: > > Hi Heath, > > I just learned that people from Alibaba already made some good progress with > FLINK-9953. I'm currently talking to them in order to see how we can merge > this contribution into Flink as fast as possible. Since I'm quite busy due to > the upcoming release I hope that other community members will help out with > the reviewing once the PRs are opened. > > Cheers, > Till > >> On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton wrote: >> Has any progress been made on this? There are a number of folks in >> the community looking to help out. >> >> >> -H >> >> On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann wrote: >> > >> > Hi Derek, >> > >> > there is this issue [1] which tracks the active Kubernetes integration. >> > Jin Sun already started implementing some parts of it. There should also >> > be some PRs open for it. Please check them out. >> > >> > [1] https://issues.apache.org/jira/browse/FLINK-9953 >> > >> > Cheers, >> > Till >> > >> > On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee wrote: >> >> >> >> Sounds good. >> >> >> >> Is someone working on this automation today? >> >> >> >> If not, although my time is tight, I may be able to work on a PR for >> >> getting us started down the path Kubernetes native cluster mode. >> >> >> >> >> >> On 12/4/18 5:35 AM, Till Rohrmann wrote: >> >> >> >> Hi Derek, >> >> >> >> what I would recommend to use is to trigger the cancel with savepoint >> >> command [1]. This will create a savepoint and terminate the job >> >> execution. Next you simply need to respawn the job cluster which you >> >> provide with the savepoint to resume from. >> >> >> >> [1] >> >> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint >> >> >> >> Cheers, >> >> Till >> >> >> >> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin >> >> wrote: >> >>> >> >>> Hi Derek, >> >>> >> >>> I think your automation steps look good. >> >>> Recreating deployments should not take long >> >>> and as you mention, this way you can avoid unpredictable old/new version >> >>> collisions. >> >>> >> >>> Best, >> >>> Andrey >> >>> >> >>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz >> >>> > wrote: >> >>> > >> >>> > Hi Derek, >> >>> > >> >>> > I am not an expert in kubernetes, so I will cc Till, who should be able >> >>> > to help you more. >> >>> > >> >>> > As for the automation for similar process I would recommend having a >> >>> > look at dA platform[1] which is built on top of kubernetes. >> >>> > >> >>> > Best, >> >>> > >> >>> > Dawid >> >>> > >> >>> > [1] https://data-artisans.com/platform-overview >> >>> > >> >>> > On 30/11/2018 02:10, Derek VerLee wrote: >> >>> >> >> >>> >> I'm looking at the job cluster mode, it looks great and I and >> >>> >> considering migrating our jobs off our "legacy" session cluster and >> >>> >> into Kubernetes. >> >>> >> >> >>> >> I do need to ask some questions because I haven't found a lot of >> >>> >> details in the documentation about how it works yet, and I gave up >> >>> >> following the the DI around in the code after a while. >> >>> >> >> >>> >> Let's say I have a deployment for the job "leader" in HA with ZK, and >> >>> >> another deployment for the taskmanagers. >> >>> >> >> >>> >> I want to upgrade the code or configuration and start from a >> >>> >> savepoint, in an automated way. >> >>> >> >> >>> >> Best I can figure, I can not just update the deployment resources in >> >>> >> kubernetes and allow the containers to restart in an arbitrary order. >> >>> >> >> >>> >> Instead, I expect sequencing is important, something along the lines >> >>> >> of this: >> >>> >> >> >>> >> 1. issue savepoint command on leader >> >>> >> 2. wait for savepoint >> >>> >> 3. destroy all leader and taskmanager containers >> >>> >> 4. deploy new leader, with savepoint url >> >>> >> 5. deploy new taskmanagers >> >>> >> >> >>> >> >> >>> >> For example, I imagine old taskmanagers (with an old version of my >> >>> >> job) attaching to the new leader and causing a problem. >> >>> >> >> >>> >> Does that sound right, or am I overthinking it? >> >>> >> >> >>> >> If not, has anyone tried implementing any automation for this yet? >> >>> >> >> >>> > >> >>>
Re: long lived standalone job session cluster in kubernetes
Hi Heath, I just learned that people from Alibaba already made some good progress with FLINK-9953. I'm currently talking to them in order to see how we can merge this contribution into Flink as fast as possible. Since I'm quite busy due to the upcoming release I hope that other community members will help out with the reviewing once the PRs are opened. Cheers, Till On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton wrote: > Has any progress been made on this? There are a number of folks in > the community looking to help out. > > > -H > > On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann > wrote: > > > > Hi Derek, > > > > there is this issue [1] which tracks the active Kubernetes integration. > Jin Sun already started implementing some parts of it. There should also be > some PRs open for it. Please check them out. > > > > [1] https://issues.apache.org/jira/browse/FLINK-9953 > > > > Cheers, > > Till > > > > On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee > wrote: > >> > >> Sounds good. > >> > >> Is someone working on this automation today? > >> > >> If not, although my time is tight, I may be able to work on a PR for > getting us started down the path Kubernetes native cluster mode. > >> > >> > >> On 12/4/18 5:35 AM, Till Rohrmann wrote: > >> > >> Hi Derek, > >> > >> what I would recommend to use is to trigger the cancel with savepoint > command [1]. This will create a savepoint and terminate the job execution. > Next you simply need to respawn the job cluster which you provide with the > savepoint to resume from. > >> > >> [1] > https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint > >> > >> Cheers, > >> Till > >> > >> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin < > and...@data-artisans.com> wrote: > >>> > >>> Hi Derek, > >>> > >>> I think your automation steps look good. > >>> Recreating deployments should not take long > >>> and as you mention, this way you can avoid unpredictable old/new > version collisions. > >>> > >>> Best, > >>> Andrey > >>> > >>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz > wrote: > >>> > > >>> > Hi Derek, > >>> > > >>> > I am not an expert in kubernetes, so I will cc Till, who should be > able > >>> > to help you more. > >>> > > >>> > As for the automation for similar process I would recommend having a > >>> > look at dA platform[1] which is built on top of kubernetes. > >>> > > >>> > Best, > >>> > > >>> > Dawid > >>> > > >>> > [1] https://data-artisans.com/platform-overview > >>> > > >>> > On 30/11/2018 02:10, Derek VerLee wrote: > >>> >> > >>> >> I'm looking at the job cluster mode, it looks great and I and > >>> >> considering migrating our jobs off our "legacy" session cluster and > >>> >> into Kubernetes. > >>> >> > >>> >> I do need to ask some questions because I haven't found a lot of > >>> >> details in the documentation about how it works yet, and I gave up > >>> >> following the the DI around in the code after a while. > >>> >> > >>> >> Let's say I have a deployment for the job "leader" in HA with ZK, > and > >>> >> another deployment for the taskmanagers. > >>> >> > >>> >> I want to upgrade the code or configuration and start from a > >>> >> savepoint, in an automated way. > >>> >> > >>> >> Best I can figure, I can not just update the deployment resources in > >>> >> kubernetes and allow the containers to restart in an arbitrary > order. > >>> >> > >>> >> Instead, I expect sequencing is important, something along the lines > >>> >> of this: > >>> >> > >>> >> 1. issue savepoint command on leader > >>> >> 2. wait for savepoint > >>> >> 3. destroy all leader and taskmanager containers > >>> >> 4. deploy new leader, with savepoint url > >>> >> 5. deploy new taskmanagers > >>> >> > >>> >> > >>> >> For example, I imagine old taskmanagers (with an old version of my > >>> >> job) attaching to the new leader and causing a problem. > >>> >> > >>> >> Does that sound right, or am I overthinking it? > >>> >> > >>> >> If not, has anyone tried implementing any automation for this yet? > >>> >> > >>> > > >>> >
Re: long lived standalone job session cluster in kubernetes
Has any progress been made on this? There are a number of folks in the community looking to help out. -H On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann wrote: > > Hi Derek, > > there is this issue [1] which tracks the active Kubernetes integration. Jin > Sun already started implementing some parts of it. There should also be some > PRs open for it. Please check them out. > > [1] https://issues.apache.org/jira/browse/FLINK-9953 > > Cheers, > Till > > On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee wrote: >> >> Sounds good. >> >> Is someone working on this automation today? >> >> If not, although my time is tight, I may be able to work on a PR for getting >> us started down the path Kubernetes native cluster mode. >> >> >> On 12/4/18 5:35 AM, Till Rohrmann wrote: >> >> Hi Derek, >> >> what I would recommend to use is to trigger the cancel with savepoint >> command [1]. This will create a savepoint and terminate the job execution. >> Next you simply need to respawn the job cluster which you provide with the >> savepoint to resume from. >> >> [1] >> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint >> >> Cheers, >> Till >> >> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin >> wrote: >>> >>> Hi Derek, >>> >>> I think your automation steps look good. >>> Recreating deployments should not take long >>> and as you mention, this way you can avoid unpredictable old/new version >>> collisions. >>> >>> Best, >>> Andrey >>> >>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz wrote: >>> > >>> > Hi Derek, >>> > >>> > I am not an expert in kubernetes, so I will cc Till, who should be able >>> > to help you more. >>> > >>> > As for the automation for similar process I would recommend having a >>> > look at dA platform[1] which is built on top of kubernetes. >>> > >>> > Best, >>> > >>> > Dawid >>> > >>> > [1] https://data-artisans.com/platform-overview >>> > >>> > On 30/11/2018 02:10, Derek VerLee wrote: >>> >> >>> >> I'm looking at the job cluster mode, it looks great and I and >>> >> considering migrating our jobs off our "legacy" session cluster and >>> >> into Kubernetes. >>> >> >>> >> I do need to ask some questions because I haven't found a lot of >>> >> details in the documentation about how it works yet, and I gave up >>> >> following the the DI around in the code after a while. >>> >> >>> >> Let's say I have a deployment for the job "leader" in HA with ZK, and >>> >> another deployment for the taskmanagers. >>> >> >>> >> I want to upgrade the code or configuration and start from a >>> >> savepoint, in an automated way. >>> >> >>> >> Best I can figure, I can not just update the deployment resources in >>> >> kubernetes and allow the containers to restart in an arbitrary order. >>> >> >>> >> Instead, I expect sequencing is important, something along the lines >>> >> of this: >>> >> >>> >> 1. issue savepoint command on leader >>> >> 2. wait for savepoint >>> >> 3. destroy all leader and taskmanager containers >>> >> 4. deploy new leader, with savepoint url >>> >> 5. deploy new taskmanagers >>> >> >>> >> >>> >> For example, I imagine old taskmanagers (with an old version of my >>> >> job) attaching to the new leader and causing a problem. >>> >> >>> >> Does that sound right, or am I overthinking it? >>> >> >>> >> If not, has anyone tried implementing any automation for this yet? >>> >> >>> > >>>
Re: long lived standalone job session cluster in kubernetes
Hi Derek, there is this issue [1] which tracks the active Kubernetes integration. Jin Sun already started implementing some parts of it. There should also be some PRs open for it. Please check them out. [1] https://issues.apache.org/jira/browse/FLINK-9953 Cheers, Till On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee wrote: > Sounds good. > > Is someone working on this automation today? > > If not, although my time is tight, I may be able to work on a PR for > getting us started down the path Kubernetes native cluster mode. > > > On 12/4/18 5:35 AM, Till Rohrmann wrote: > > Hi Derek, > > what I would recommend to use is to trigger the cancel with savepoint > command [1]. This will create a savepoint and terminate the job execution. > Next you simply need to respawn the job cluster which you provide with the > savepoint to resume from. > > [1] > https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint > > Cheers, > Till > > On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin > wrote: > >> Hi Derek, >> >> I think your automation steps look good. >> Recreating deployments should not take long >> and as you mention, this way you can avoid unpredictable old/new version >> collisions. >> >> Best, >> Andrey >> >> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz >> wrote: >> > >> > Hi Derek, >> > >> > I am not an expert in kubernetes, so I will cc Till, who should be able >> > to help you more. >> > >> > As for the automation for similar process I would recommend having a >> > look at dA platform[1] which is built on top of kubernetes. >> > >> > Best, >> > >> > Dawid >> > >> > [1] https://data-artisans.com/platform-overview >> > >> > On 30/11/2018 02:10, Derek VerLee wrote: >> >> >> >> I'm looking at the job cluster mode, it looks great and I and >> >> considering migrating our jobs off our "legacy" session cluster and >> >> into Kubernetes. >> >> >> >> I do need to ask some questions because I haven't found a lot of >> >> details in the documentation about how it works yet, and I gave up >> >> following the the DI around in the code after a while. >> >> >> >> Let's say I have a deployment for the job "leader" in HA with ZK, and >> >> another deployment for the taskmanagers. >> >> >> >> I want to upgrade the code or configuration and start from a >> >> savepoint, in an automated way. >> >> >> >> Best I can figure, I can not just update the deployment resources in >> >> kubernetes and allow the containers to restart in an arbitrary order. >> >> >> >> Instead, I expect sequencing is important, something along the lines >> >> of this: >> >> >> >> 1. issue savepoint command on leader >> >> 2. wait for savepoint >> >> 3. destroy all leader and taskmanager containers >> >> 4. deploy new leader, with savepoint url >> >> 5. deploy new taskmanagers >> >> >> >> >> >> For example, I imagine old taskmanagers (with an old version of my >> >> job) attaching to the new leader and causing a problem. >> >> >> >> Does that sound right, or am I overthinking it? >> >> >> >> If not, has anyone tried implementing any automation for this yet? >> >> >> > >> >>
Re: long lived standalone job session cluster in kubernetes
Sounds good. Is someone working on this automation today? If not, although my time is tight, I may be able to work on a PR for getting us started down the path Kubernetes native cluster mode. On 12/4/18 5:35 AM, Till Rohrmann wrote: Hi Derek, what I would recommend to use is to trigger the cancel with savepoint command [1]. This will create a savepoint and terminate the job execution. Next you simply need to respawn the job cluster which you provide with the savepoint to resume from. [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint Cheers, Till On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebinwrote: Hi Derek, I think your automation steps look good. Recreating deployments should not take long and as you mention, this way you can avoid unpredictable old/new version collisions. Best, Andrey > On 4 Dec 2018, at 10:22, Dawid Wysakowicz wrote: > > Hi Derek, > > I am not an expert in kubernetes, so I will cc Till, who should be able > to help you more. > > As for the automation for similar process I would recommend having a > look at dA platform[1] which is built on top of kubernetes. > > Best, > > Dawid > > [1] https://data-artisans.com/platform-overview > > On 30/11/2018 02:10, Derek VerLee wrote: >> >> I'm looking at the job cluster mode, it looks great and I and >> considering migrating our jobs off our "legacy" session cluster and >> into Kubernetes. >> >> I do need to ask some questions because I haven't found a lot of >> details in the documentation about how it works yet, and I gave up >> following the the DI around in the code after a while. >> >> Let's say I have a deployment for the job "leader" in HA with ZK, and >> another deployment for the taskmanagers. >> >> I want to upgrade the code or configuration and start from a >> savepoint, in an automated way. >> >> Best I can figure, I can not just update the deployment resources in >> kubernetes and allow the containers to restart in an arbitrary order. >> >> Instead, I expect sequencing is important, something along the lines >> of this: >> >> 1. issue savepoint command on leader >> 2. wait for savepoint >> 3. destroy all leader and taskmanager containers >> 4. deploy new leader, with savepoint url >> 5. deploy new taskmanagers >> >> >> For example, I imagine old taskmanagers (with an old version of my >> job) attaching to the new leader and causing a problem. >> >> Does that sound right, or am I overthinking it? >> >> If not, has anyone tried implementing any automation for this yet? >> >
Re: long lived standalone job session cluster in kubernetes
Hi Derek, what I would recommend to use is to trigger the cancel with savepoint command [1]. This will create a savepoint and terminate the job execution. Next you simply need to respawn the job cluster which you provide with the savepoint to resume from. [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint Cheers, Till On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin wrote: > Hi Derek, > > I think your automation steps look good. > Recreating deployments should not take long > and as you mention, this way you can avoid unpredictable old/new version > collisions. > > Best, > Andrey > > > On 4 Dec 2018, at 10:22, Dawid Wysakowicz > wrote: > > > > Hi Derek, > > > > I am not an expert in kubernetes, so I will cc Till, who should be able > > to help you more. > > > > As for the automation for similar process I would recommend having a > > look at dA platform[1] which is built on top of kubernetes. > > > > Best, > > > > Dawid > > > > [1] https://data-artisans.com/platform-overview > > > > On 30/11/2018 02:10, Derek VerLee wrote: > >> > >> I'm looking at the job cluster mode, it looks great and I and > >> considering migrating our jobs off our "legacy" session cluster and > >> into Kubernetes. > >> > >> I do need to ask some questions because I haven't found a lot of > >> details in the documentation about how it works yet, and I gave up > >> following the the DI around in the code after a while. > >> > >> Let's say I have a deployment for the job "leader" in HA with ZK, and > >> another deployment for the taskmanagers. > >> > >> I want to upgrade the code or configuration and start from a > >> savepoint, in an automated way. > >> > >> Best I can figure, I can not just update the deployment resources in > >> kubernetes and allow the containers to restart in an arbitrary order. > >> > >> Instead, I expect sequencing is important, something along the lines > >> of this: > >> > >> 1. issue savepoint command on leader > >> 2. wait for savepoint > >> 3. destroy all leader and taskmanager containers > >> 4. deploy new leader, with savepoint url > >> 5. deploy new taskmanagers > >> > >> > >> For example, I imagine old taskmanagers (with an old version of my > >> job) attaching to the new leader and causing a problem. > >> > >> Does that sound right, or am I overthinking it? > >> > >> If not, has anyone tried implementing any automation for this yet? > >> > > > >
Re: long lived standalone job session cluster in kubernetes
Hi Derek, I think your automation steps look good. Recreating deployments should not take long and as you mention, this way you can avoid unpredictable old/new version collisions. Best, Andrey > On 4 Dec 2018, at 10:22, Dawid Wysakowicz wrote: > > Hi Derek, > > I am not an expert in kubernetes, so I will cc Till, who should be able > to help you more. > > As for the automation for similar process I would recommend having a > look at dA platform[1] which is built on top of kubernetes. > > Best, > > Dawid > > [1] https://data-artisans.com/platform-overview > > On 30/11/2018 02:10, Derek VerLee wrote: >> >> I'm looking at the job cluster mode, it looks great and I and >> considering migrating our jobs off our "legacy" session cluster and >> into Kubernetes. >> >> I do need to ask some questions because I haven't found a lot of >> details in the documentation about how it works yet, and I gave up >> following the the DI around in the code after a while. >> >> Let's say I have a deployment for the job "leader" in HA with ZK, and >> another deployment for the taskmanagers. >> >> I want to upgrade the code or configuration and start from a >> savepoint, in an automated way. >> >> Best I can figure, I can not just update the deployment resources in >> kubernetes and allow the containers to restart in an arbitrary order. >> >> Instead, I expect sequencing is important, something along the lines >> of this: >> >> 1. issue savepoint command on leader >> 2. wait for savepoint >> 3. destroy all leader and taskmanager containers >> 4. deploy new leader, with savepoint url >> 5. deploy new taskmanagers >> >> >> For example, I imagine old taskmanagers (with an old version of my >> job) attaching to the new leader and causing a problem. >> >> Does that sound right, or am I overthinking it? >> >> If not, has anyone tried implementing any automation for this yet? >> >
Re: long lived standalone job session cluster in kubernetes
Hi Derek, I am not an expert in kubernetes, so I will cc Till, who should be able to help you more. As for the automation for similar process I would recommend having a look at dA platform[1] which is built on top of kubernetes. Best, Dawid [1] https://data-artisans.com/platform-overview On 30/11/2018 02:10, Derek VerLee wrote: > > I'm looking at the job cluster mode, it looks great and I and > considering migrating our jobs off our "legacy" session cluster and > into Kubernetes. > > I do need to ask some questions because I haven't found a lot of > details in the documentation about how it works yet, and I gave up > following the the DI around in the code after a while. > > Let's say I have a deployment for the job "leader" in HA with ZK, and > another deployment for the taskmanagers. > > I want to upgrade the code or configuration and start from a > savepoint, in an automated way. > > Best I can figure, I can not just update the deployment resources in > kubernetes and allow the containers to restart in an arbitrary order. > > Instead, I expect sequencing is important, something along the lines > of this: > > 1. issue savepoint command on leader > 2. wait for savepoint > 3. destroy all leader and taskmanager containers > 4. deploy new leader, with savepoint url > 5. deploy new taskmanagers > > > For example, I imagine old taskmanagers (with an old version of my > job) attaching to the new leader and causing a problem. > > Does that sound right, or am I overthinking it? > > If not, has anyone tried implementing any automation for this yet? > signature.asc Description: OpenPGP digital signature
long lived standalone job session cluster in kubernetes
I'm looking at the job cluster mode, it looks great and I and considering migrating our jobs off our "legacy" session cluster and into Kubernetes. I do need to ask some questions because I haven't found a lot of details in the documentation about how it works yet, and I gave up following the the DI around in the code after a while. Let's say I have a deployment for the job "leader" in HA with ZK, and another deployment for the taskmanagers. I want to upgrade the code or configuration and start from a savepoint, in an automated way. Best I can figure, I can not just update the deployment resources in kubernetes and allow the containers to restart in an arbitrary order. Instead, I expect sequencing is important, something along the lines of this: 1. issue savepoint command on leader 2. wait for savepoint 3. destroy all leader and taskmanager containers 4. deploy new leader, with savepoint url 5. deploy new taskmanagers For example, I imagine old taskmanagers (with an old version of my job) attaching to the new leader and causing a problem. Does that sound right, or am I overthinking it? If not, has anyone tried implementing any automation for this yet?