Re: Recovery instructions updates

Bill Farner Tue, 05 Jun 2018 06:58:10 -0700

>
> How does the site get updated? Is it auto-generated when we build releases?



The source lives in the project SVN repo:

$ svn co https://svn.apache.org/repos/asf/aurora/ aurora-svn

Here <https://svn.apache.org/repos/asf/aurora/site/README.md> are
instructions for updating it.  It's a pretty mechanical process, but not
automated.

So, do we plan to add the patch for next release?


Meghdoot - i suspect David would appreciate an incoming patch for the docs,
assuming that's what you're referring to.

He mentioned this step
<https://github.com/apache/aurora/blob/34be631589ebf899e663b698dc76511eb1b9ad8a/src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh#L523-L530>
in the end-to-end tests, which is (hopefully) straightforward enough to try
without assistance.



On Tue, Jun 5, 2018 at 12:47 AM, meghdoot bhattacharya <
meghdoo...@yahoo.com.invalid> wrote:

> Thx David. So, do we plan to add the patch for next release? We will be
> happy to validate it as part of rc validation.
>
>
>
>       From: David McLaughlin <dmclaugh...@apache.org>
>  To: dev@aurora.apache.org
>  Sent: Monday, June 4, 2018 9:45 AM
>  Subject: Re: Recovery instructions updates
>
> We should definitely update that doc, Bill's patch makes this much easier
> (as can be seen by the e2e test) and we've been using it in our scale test
> environment. How does the site get updated? Is it auto-generated when we
> build releases?
>
> Having corrupted logs that frequently is concerning too, we haven't seen
> anything like this and we do explicit snapshots/backups as part of every
> Scheduler deploy. If there's a bug lurking, would be good to get in front
> of it.
>
> On Sun, Jun 3, 2018 at 10:40 AM, Meghdoot bhattacharya <
> meghdoo...@yahoo.com.invalid> wrote:
>
> > We will try to recover the log files on the snapshot loading error.
> >
> > + 1 to Bill’s approach on making offline recovery. We will try the patch
> > on our side.
> >
> > Renan, I would ask you to prepare a PR for the restoration docs proposing
> > the 2 additional steps required in current world as we look to maybe
> using
> > a different mechanism. The prep steps to get scheduler ready for backup
> can
> > be eliminated hopefully with the alternative approach.
> >
> > On side lets see if we can recover the logs of the corrupted snapshot
> > loading.
> >
> >
> > Thx
> >
> > > On Jun 3, 2018, at 9:50 AM, Stephan Erb <stephan....@blue-yonder.com>
> > wrote:
> > >
> > > That sounds indeed concerning. Would be great if you could file an
> issue
> > and attach the related log files and tracebacks.
> > >
> > > Bill recently added a potential replacement for the existing restore
> > mechanism: https://github.com/apache/aurora/commit/
> > 2e1ca42887bc8ea1e8c6cddebe9d1cf29268c714. Given the set of issues you
> > have bumped into with the current restore, this new approach might be
> worth
> > exploring further.
> > >
> > > On 03.06.18, 08:43, "Meghdoot bhattacharya"
> > <meghdoo...@yahoo.com.INVALID> wrote:
> > >
> > >    Thx Renan for sharing the details. This backup restore happened
> under
> > not so easy circumstances, so would encourage the leads to keep docs
> > updated as much as possible and include in release validation.
> > >
> > >    The other issue of snapshots having task and other objects as nil
> > that causes to fail the schedulers, we have now seen 2 times in past
> year.
> > Other than finding root cause why that entry happens during snapshot
> > creation, there needs to be defensive code either to ignore that entry on
> > loading or a way to fix the snapshot. Because we might have to go
> through a
> > days worth of snapshots to find which one did not had that entry and
> > recover from there. Mean time to recover gets impacted under the
> > circumstances. One extra info not sure is relevant or not is the
> corrupted
> > snapshot got created by the admin cli (assumption should not matter
> whether
> > scheduler triggers or forced via cli) that showed success as well as the
> > aurora logs but then loading it exposed the issue.
> > >
> > >    Thx
> > >
> > >> On Jun 2, 2018, at 3:54 PM, Renan DelValle <renanidelva...@gmail.com>
> > wrote:
> > >>
> > >> Hi all,
> > >>
> > >> We tried following the recovery instructions from
> > >> http://aurora.apache.org/documentation/latest/
> > operations/backup-restore/
> > >>
> > >> After our change from the Twitter commons ZK library to Apache
> Curator,
> > >> these instructions are no longer valid.
> > >>
> > >> In order for Aurora to carry out a leader election in the current
> state,
> > >> Aurora has to first connect to a Mesos master. What we ended up doing
> > was
> > >> connecting to Mesos master that was had nothing on it to bypass this
> new
> > >> requirement.
> > >>
> > >> Next, wiping away -native_log_file_path did not seem to be enough to
> > >> recover from a corrupted mesos replicated log. We had to manually wipe
> > away
> > >> entries in ZK and move the snapshot backup directory in order for the
> > >> leader to not fall back on either a snapshot or the mesos-log to
> > rehydrate
> > >> the leader.
> > >>
> > >> Finally, somehow triggering a manual snapshot generated a snapshot
> with
> > an
> > >> invalid entry which then caused the scheduler to fail after a failover
> > >> while trying to catch up on current state.
> > >>
> > >> We are trying to investigate why this took place (it could have been
> we
> > >> didn't give the system enough time to finish hydrating the snapshot),
> > but
> > >> the invalid entry which looked something like a Task with all null or
> 0
> > >> values, caused our leaders to fail (which necessitated restoring from
> an
> > >> earlier snapshot) and note that this was only after we triggered the
> > manual
> > >> snapshot and BEFORE we tried to restore.
> > >>
> > >> Will report more details as they become available and will provide
> some
> > doc
> > >> updates based on our experience.
> > >>
> > >> -Renan
> > >
> > >
> > >
> >
> >
>
>
>

Re: Recovery instructions updates

Reply via email to