> > Once its started and it can open log it won't crash and starts mesos-log > recovery
My memory is fuzzy here, but i was under the impression that holes in the log were filled before open() returned. Have you observed otherwise? On Wed, Jun 22, 2016 at 8:02 AM, Martin Hrabovčin < martin.hrabov...@gmail.com> wrote: > If there is some obvious issue with replicated log then open() call would > fail and caused aurora to exist or restart itself. I am looking at > different issue - If there are 3 aurora instances that needs the update its > hard to tell right now at which point its safe to move from one instance to > another. Lets say there is rolling update going and applying update on each > aurora instance at the time. One instance is down and out of rotation. Once > its started and it can open log it won't crash and starts mesos-log > recovery. But if you start doing upgrade on 2nd instance before mesos-log > is replicated to first one its easy to loose quorum and data. I'd like to > have some deterministic check that would allow to ensure that its safe to > consider log replicated. > > 2016-06-17 16:05 GMT+02:00 Bill Farner <wfar...@apache.org>: > > > If i recall correctly, the current implementation of the mesos log > requires > > that the callers handle mutually-exclusive access for reads and writes. > > This means that non-leading schdulers may not read or write to perform > the > > check you describe. > > > > What's the behavior of the scheduler when it starts and the log replica > is > > non-VOTING? I thought the log open() call would fail, and the scheduler > > process would exit (giving a strong signal that the scheduler is not > > healthy). > > > > On Fri, Jun 17, 2016 at 2:44 AM, Martin Hrabovčin < > > martin.hrabov...@gmail.com> wrote: > > > > > Hello, > > > > > > I was asking same question in #aurora channel and I still haven't found > > an > > > answer so I am bringing this in mailing list with a proposal. > > > > > > Is there a way to check the state of mesos-log (whether the its > writable > > in > > > VOTING state) through some HTTP check outside of aurora process on a > > > non-leading aurora instance? We are trying to create external check > that > > > would determine whether the mesos-log is ready in case of aurora > rolling > > > update. When adding new instance to existing aurora cluster and we want > > to > > > make sure that mesos-log is replicated and replica is ready to serve > > reads > > > and writes. Currently we’re grep-ing java process log and looking for > > > “Persisted replica status to VOTING”. > > > > > > I was pointed to /vars endpoint but I haven't found obvious answer > there. > > > > > > I'd like to propose creating new HTTP endpoint "/loghealth" that would > > > similarly to "/leaderhealth" return 200 when mesos-log is ready and 503 > > in > > > case when mesos log throws exception. As for implementation I was > > thinking > > > about doing simple read from log or write noop to log directly. > > > > > > Thanks! > > > > > >