Re: Optimizing for super long checkpoint times

Steven Wu Sat, 12 Dec 2020 11:16:58 -0800

> things are actually moving pretty smoothly

Do you mean the job is otherwise healthy? like there is no lag etc.


Do you see any bottleneck at system level, like CPU, network, disk I/O etc.?

On Sat, Dec 12, 2020 at 10:54 AM Rex Fenley <r...@remind101.com> wrote:

> Hi,
>
> We're running a job with on the order of >100GiB of state. For our initial
> run we wanted to keep things simple, so we allocated a single core node
> with 1 Taskmanager and 1 parallelism and 1 TiB storage (split between 4
> disks on that machine). Overall, things are actually moving pretty
> smoothly, except for checkpointing. Checkpoints are set to be incremental,
> yet they're all in the range of 10-20 GiB -- we do have a lot of data being
> updated in real-time, retracts+appends -- and they take around 10-30 min.
> We have the Taskmanager to set to checkpoint every 5 min which means we're
> spending the majority of our time just checkpointing.
>
> My question is, what sort of bottlenecks should we be investigating and
> what are some things we can try to improve our checkpoint times?
>
> Some things we're considering are:
> Increasing parallelism, hoping that this will partition the data and each
> operator will therefore checkpoint faster.
> Changing time between checkpoints, though we don't have a good
> understanding of how this might affect total time.
>
> Also, we are hesitant to use unaligned checkpointing at the moment and are
> hoping for some other options.
>
> Thanks!
>
> --
>
> Rex Fenley  |  Software Engineer - Mobile and Backend
>
>
> Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>  |
>  FOLLOW US <https://twitter.com/remindhq>  |  LIKE US
> <https://www.facebook.com/remindhq>
>

Re: Optimizing for super long checkpoint times

Reply via email to