Do you have advice on how to determine why a checkpoint failed? 1. Timeout (that's easy to discover as the UI logs them). 2. Other errors are not so easy to find. How can I find other errors? Are they in the UI, or good old-fashioned logging?
On Fri, Jan 29, 2021 at 3:11 AM Congxian Qiu <qcx978132...@gmail.com> wrote: > Hi Marco > You need to figure out why the checkpoint timed out(you can see the > consumed time of each period for one checkpoint in UI), if it indeed needs > such long time to complete the checkpoint, then you need to configure a > longer timeout. > If there are some checkpoint errors, we need first to figure out what > the problem is, in general, a checkpoint can split into some parts such as > barrie alignment(maybe there is some backpressure or something else, that > some barrier can't be received in time), sync duration(the thread is too > busy ...), and async duration(too much io/network process ...), etc. > > Best, > Congxian > > > Marco Villalobos <mvillalo...@kineteque.com> 于2021年1月29日周五 上午7:19写道: > >> I am kind of stuck in determining how large a checkpoint interval should >> be. >> >> Is there a guide for that? If a timeout time is 10 minutes, we time out, >> what is a good strategy for adjusting that? >> >> Where is a good starting point for a checkpoint? How shall they be >> adjusted? >> >> We often see checkpoint errors during our onTimer calls, I don't know if >> that's related. >> >> Marco A. Villalobos >> >> >>