> > Is there a way to make the new yarn job only on the new hardware?
I think you can simply decommission the nodes from Yarn, so that new containers will not be allocated from those nodes. You might also need a large decommission timeout, upon which all the remaining running contains on the decommissioning node will be killed. Thank you~ Xintong Song On Fri, Nov 13, 2020 at 2:57 PM Robert Metzger <rmetz...@apache.org> wrote: > Hi, > it seems that YARN has a feature for targeting specific hardware: > https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/PlacementConstraints.html > In any case, you'll need enough spare resources for some time to be able > to run your job twice for this kind of "zero downtime handover" > > On Thu, Nov 12, 2020 at 6:10 PM Rex Fenley <r...@remind101.com> wrote: > >> Awesome, thanks! Is there a way to make the new yarn job only on the new >> hardware? Or would the two jobs have to run on intersecting hardware and >> then would be switched on/off, which means we'll need a buffer of resources >> for our orchestration? >> >> Also, good point on recovery. I'll spend some time looking into this. >> >> Thanks >> >> >> On Wed, Nov 11, 2020 at 11:53 PM Robert Metzger <rmetz...@apache.org> >> wrote: >> >>> Hey Rex, >>> >>> the second approach (spinning up a standby job and then doing a >>> handover) sounds more promising to implement, without rewriting half of the >>> Flink codebase ;) >>> What you need is a tool that orchestrates creating a savepoint, starting >>> a second job from the savepoint and then communicating with a custom sink >>> implementation that can be switched on/off in the two jobs. >>> With that approach, you should have almost no downtime, just increased >>> resource requirements during such a handover. >>> >>> What you need to consider as well is that this handover process only >>> works for scheduled maintenance. If you have a system failure, you'll have >>> downtime until the last checkpoint is restored. >>> If you are trying to reduce the potential downtime overall, I would >>> rather recommend optimizing the checkpoint restore time, as this can cover >>> both scheduled maintenance and system failures. >>> >>> Best, >>> Robert >>> >>> >>> >>> >>> >>> On Wed, Nov 11, 2020 at 8:56 PM Rex Fenley <r...@remind101.com> wrote: >>> >>>> Another thought, would it be possible to >>>> * Spin up new core or task nodes. >>>> * Run a new copy of the same job on these new nodes from a savepoint. >>>> * Have the new job *not* write to the sink until the other job is torn >>>> down? >>>> >>>> This would allow us to be eventually consistent and maintain writes >>>> going through without downtime. As long as whatever is buffering the sink >>>> doesn't run out of space it should work just fine. >>>> >>>> We're hoping to achieve consistency in less than 30s ideally. >>>> >>>> Again though, if we could get savepoints to restore in less than 30s >>>> that would likely be sufficient for our purposes. >>>> >>>> Thanks! >>>> >>>> On Wed, Nov 11, 2020 at 11:42 AM Rex Fenley <r...@remind101.com> wrote: >>>> >>>>> Hello, >>>>> >>>>> I'm trying to find a solution for auto scaling our Flink EMR cluster >>>>> with 0 downtime using RocksDB as state storage and S3 backing store. >>>>> >>>>> My current thoughts are like so: >>>>> * Scaling an Operator dynamically would require all keyed state to be >>>>> available to the set of subtasks for that operator, therefore a set of >>>>> subtasks must be reading to and writing from the same RocksDB. I.e. to >>>>> scale in and out subtasks in that set, they need to read from the same >>>>> Rocks. >>>>> * Since subtasks can run on different core nodes, is it possible to >>>>> have different core nodes read/write to the same RocksDB? >>>>> * When's the safe point to scale in and out an operator? Only right >>>>> after a checkpoint possibly? >>>>> >>>>> If the above is not possible then we'll have to use save points which >>>>> means some downtime, therefore: >>>>> * Scaling out during high traffic is arguably more important to react >>>>> quickly to than scaling in during low traffic. Is it possible to add more >>>>> core nodes to EMR without disturbing a job? If so then maybe we can >>>>> orchestrate running a new job on new nodes and then loading a savepoint >>>>> from a currently running job. >>>>> >>>>> Lastly >>>>> * Save Points for ~70Gib of data take on the order of minutes to tens >>>>> of minutes for us to restore from, is there any way to speed up >>>>> restoration? >>>>> >>>>> Thanks! >>>>> >>>>> -- >>>>> >>>>> Rex Fenley | Software Engineer - Mobile and Backend >>>>> >>>>> >>>>> Remind.com <https://www.remind.com/> | BLOG <http://blog.remind.com/> >>>>> | FOLLOW US <https://twitter.com/remindhq> | LIKE US >>>>> <https://www.facebook.com/remindhq> >>>>> >>>> >>>> >>>> -- >>>> >>>> Rex Fenley | Software Engineer - Mobile and Backend >>>> >>>> >>>> Remind.com <https://www.remind.com/> | BLOG <http://blog.remind.com/> >>>> | FOLLOW US <https://twitter.com/remindhq> | LIKE US >>>> <https://www.facebook.com/remindhq> >>>> >>> >> >> -- >> >> Rex Fenley | Software Engineer - Mobile and Backend >> >> >> Remind.com <https://www.remind.com/> | BLOG <http://blog.remind.com/> >> | FOLLOW US <https://twitter.com/remindhq> | LIKE US >> <https://www.facebook.com/remindhq> >> >