> > How does the site get updated? Is it auto-generated when we build releases?
The source lives in the project SVN repo: $ svn co https://svn.apache.org/repos/asf/aurora/ aurora-svn Here <https://svn.apache.org/repos/asf/aurora/site/README.md> are instructions for updating it. It's a pretty mechanical process, but not automated. So, do we plan to add the patch for next release? Meghdoot - i suspect David would appreciate an incoming patch for the docs, assuming that's what you're referring to. He mentioned this step <https://github.com/apache/aurora/blob/34be631589ebf899e663b698dc76511eb1b9ad8a/src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh#L523-L530> in the end-to-end tests, which is (hopefully) straightforward enough to try without assistance. On Tue, Jun 5, 2018 at 12:47 AM, meghdoot bhattacharya < meghdoo...@yahoo.com.invalid> wrote: > Thx David. So, do we plan to add the patch for next release? We will be > happy to validate it as part of rc validation. > > > > From: David McLaughlin <dmclaugh...@apache.org> > To: dev@aurora.apache.org > Sent: Monday, June 4, 2018 9:45 AM > Subject: Re: Recovery instructions updates > > We should definitely update that doc, Bill's patch makes this much easier > (as can be seen by the e2e test) and we've been using it in our scale test > environment. How does the site get updated? Is it auto-generated when we > build releases? > > Having corrupted logs that frequently is concerning too, we haven't seen > anything like this and we do explicit snapshots/backups as part of every > Scheduler deploy. If there's a bug lurking, would be good to get in front > of it. > > On Sun, Jun 3, 2018 at 10:40 AM, Meghdoot bhattacharya < > meghdoo...@yahoo.com.invalid> wrote: > > > We will try to recover the log files on the snapshot loading error. > > > > + 1 to Bill’s approach on making offline recovery. We will try the patch > > on our side. > > > > Renan, I would ask you to prepare a PR for the restoration docs proposing > > the 2 additional steps required in current world as we look to maybe > using > > a different mechanism. The prep steps to get scheduler ready for backup > can > > be eliminated hopefully with the alternative approach. > > > > On side lets see if we can recover the logs of the corrupted snapshot > > loading. > > > > > > Thx > > > > > On Jun 3, 2018, at 9:50 AM, Stephan Erb <stephan....@blue-yonder.com> > > wrote: > > > > > > That sounds indeed concerning. Would be great if you could file an > issue > > and attach the related log files and tracebacks. > > > > > > Bill recently added a potential replacement for the existing restore > > mechanism: https://github.com/apache/aurora/commit/ > > 2e1ca42887bc8ea1e8c6cddebe9d1cf29268c714. Given the set of issues you > > have bumped into with the current restore, this new approach might be > worth > > exploring further. > > > > > > On 03.06.18, 08:43, "Meghdoot bhattacharya" > > <meghdoo...@yahoo.com.INVALID> wrote: > > > > > > Thx Renan for sharing the details. This backup restore happened > under > > not so easy circumstances, so would encourage the leads to keep docs > > updated as much as possible and include in release validation. > > > > > > The other issue of snapshots having task and other objects as nil > > that causes to fail the schedulers, we have now seen 2 times in past > year. > > Other than finding root cause why that entry happens during snapshot > > creation, there needs to be defensive code either to ignore that entry on > > loading or a way to fix the snapshot. Because we might have to go > through a > > days worth of snapshots to find which one did not had that entry and > > recover from there. Mean time to recover gets impacted under the > > circumstances. One extra info not sure is relevant or not is the > corrupted > > snapshot got created by the admin cli (assumption should not matter > whether > > scheduler triggers or forced via cli) that showed success as well as the > > aurora logs but then loading it exposed the issue. > > > > > > Thx > > > > > >> On Jun 2, 2018, at 3:54 PM, Renan DelValle <renanidelva...@gmail.com> > > wrote: > > >> > > >> Hi all, > > >> > > >> We tried following the recovery instructions from > > >> http://aurora.apache.org/documentation/latest/ > > operations/backup-restore/ > > >> > > >> After our change from the Twitter commons ZK library to Apache > Curator, > > >> these instructions are no longer valid. > > >> > > >> In order for Aurora to carry out a leader election in the current > state, > > >> Aurora has to first connect to a Mesos master. What we ended up doing > > was > > >> connecting to Mesos master that was had nothing on it to bypass this > new > > >> requirement. > > >> > > >> Next, wiping away -native_log_file_path did not seem to be enough to > > >> recover from a corrupted mesos replicated log. We had to manually wipe > > away > > >> entries in ZK and move the snapshot backup directory in order for the > > >> leader to not fall back on either a snapshot or the mesos-log to > > rehydrate > > >> the leader. > > >> > > >> Finally, somehow triggering a manual snapshot generated a snapshot > with > > an > > >> invalid entry which then caused the scheduler to fail after a failover > > >> while trying to catch up on current state. > > >> > > >> We are trying to investigate why this took place (it could have been > we > > >> didn't give the system enough time to finish hydrating the snapshot), > > but > > >> the invalid entry which looked something like a Task with all null or > 0 > > >> values, caused our leaders to fail (which necessitated restoring from > an > > >> earlier snapshot) and note that this was only after we triggered the > > manual > > >> snapshot and BEFORE we tried to restore. > > >> > > >> Will report more details as they become available and will provide > some > > doc > > >> updates based on our experience. > > >> > > >> -Renan > > > > > > > > > > > > > > > >