Hi, Savepoints internally work in the exact same way as checkpoints.
I'm not sure what you are referring to as a repartitioning step. If you mean rescalling (changing the parallelism), then this can happen both for the checkpoints and savepoints. You can find answers to a lot of such questions by just searching on Google. About rescaling you can read here [1]. Piotrek [1] https://flink.apache.org/features/2017/07/04/flink-rescalable-state.html pon., 18 sty 2021 o 17:58 Rex Fenley <r...@remind101.com> napisał(a): > This is great information, thank you. It does look like task local > recovery only works for checkpoints, however we do want to bring down our > recovery times so this is useful. > > I'm wondering what all is involved with savepoints though too. Savepoint > restoration must have some repartitioning step too I'd imagine which seems > like it could be fairly involved? Anything else I'm missing? > > Thanks! > > On Mon, Jan 18, 2021 at 2:49 AM Piotr Nowojski <pnowoj...@apache.org> > wrote: > >> Hi Rex, >> >> Good that you have found the source of your problem and thanks for >> reporting it back. >> >> Regarding your question about the recovery steps (ignoring scheduling and >> deploying). I think it depends on the used state backend. From your >> other emails I see you are using RocksDB, so I believe this is the big >> picture how it works in the RocksDB case: >> >> 1. Relevant state files are loaded from the DFS to local disks of a >> TaskManager [1]. >> 2. I presume RocksDB needs to load at least some meta data from those >> local files in order to finish the recovery process (I doubt it but maybe >> it also needs to load some actual data). >> 3. Task can start processing the records. >> >> Best, >> Piotrek >> >> [1] This step can be avoided if you are using local recovery >> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html#task-local-recovery >> >> sob., 16 sty 2021 o 06:15 Rex Fenley <r...@remind101.com> napisał(a): >> >>> Aftering looking through lots of graphs and AWS limits. I've come to the >>> conclusion that we're hitting limits on our disk writes. I'm guessing this >>> is backpressuring against the entire restore process. I'm still very >>> curious about all the steps involved in savepoint restoration though! >>> >>> On Fri, Jan 15, 2021 at 7:50 PM Rex Fenley <r...@remind101.com> wrote: >>> >>>> Hello, >>>> >>>> We have a savepoint that's ~0.5 TiB in size. When we try to restore >>>> from it, we time out because it takes too long (write now checkpoint >>>> timeouts are set to 2 hours which is way above where we want them already). >>>> >>>> I'm curious if it needs to download the entire savepoint to continue. >>>> Or, for further education, what are all the operations that take place >>>> before a job is restored from a savepoint? >>>> >>>> Additionally, the network seems to be a big bottleneck. Our network >>>> should be operating in the GiB/s range per instance, but seems to operate >>>> between 70-100MiB per second when retrieving a savepoint. Are there any >>>> constraining factors in Flink's design that would slow down the network >>>> download of a savepoint this much (from S3)? >>>> >>>> Thanks! >>>> >>>> -- >>>> >>>> Rex Fenley | Software Engineer - Mobile and Backend >>>> >>>> >>>> Remind.com <https://www.remind.com/> | BLOG <http://blog.remind.com/> >>>> | FOLLOW US <https://twitter.com/remindhq> | LIKE US >>>> <https://www.facebook.com/remindhq> >>>> >>> >>> >>> -- >>> >>> Rex Fenley | Software Engineer - Mobile and Backend >>> >>> >>> Remind.com <https://www.remind.com/> | BLOG <http://blog.remind.com/> >>> | FOLLOW US <https://twitter.com/remindhq> | LIKE US >>> <https://www.facebook.com/remindhq> >>> >> > > -- > > Rex Fenley | Software Engineer - Mobile and Backend > > > Remind.com <https://www.remind.com/> | BLOG <http://blog.remind.com/> | > FOLLOW US <https://twitter.com/remindhq> | LIKE US > <https://www.facebook.com/remindhq> >