Thx Bill. We will take it up. On Jun 5, 2018, at 6:57 AM, Bill Farner <wfar...@apache.org> wrote:
>> How does the site get updated? Is it auto-generated when we build releases? > > The source lives in the project SVN repo: > > $ svn co https://svn.apache.org/repos/asf/aurora/ aurora-svn > > Here are instructions for updating it. It's a pretty mechanical process, but > not automated. > >> So, do we plan to add the patch for next release? > > Meghdoot - i suspect David would appreciate an incoming patch for the docs, > assuming that's what you're referring to. > > He mentioned this step in the end-to-end tests, which is (hopefully) > straightforward enough to try without assistance. > > > >> On Tue, Jun 5, 2018 at 12:47 AM, meghdoot bhattacharya >> <meghdoo...@yahoo.com.invalid> wrote: >> Thx David. So, do we plan to add the patch for next release? We will be >> happy to validate it as part of rc validation. >> >> >> >> From: David McLaughlin <dmclaugh...@apache.org> >> To: dev@aurora.apache.org >> Sent: Monday, June 4, 2018 9:45 AM >> Subject: Re: Recovery instructions updates >> >> We should definitely update that doc, Bill's patch makes this much easier >> (as can be seen by the e2e test) and we've been using it in our scale test >> environment. How does the site get updated? Is it auto-generated when we >> build releases? >> >> Having corrupted logs that frequently is concerning too, we haven't seen >> anything like this and we do explicit snapshots/backups as part of every >> Scheduler deploy. If there's a bug lurking, would be good to get in front >> of it. >> >> On Sun, Jun 3, 2018 at 10:40 AM, Meghdoot bhattacharya < >> meghdoo...@yahoo.com.invalid> wrote: >> >> > We will try to recover the log files on the snapshot loading error. >> > >> > + 1 to Bill’s approach on making offline recovery. We will try the patch >> > on our side. >> > >> > Renan, I would ask you to prepare a PR for the restoration docs proposing >> > the 2 additional steps required in current world as we look to maybe using >> > a different mechanism. The prep steps to get scheduler ready for backup can >> > be eliminated hopefully with the alternative approach. >> > >> > On side lets see if we can recover the logs of the corrupted snapshot >> > loading. >> > >> > >> > Thx >> > >> > > On Jun 3, 2018, at 9:50 AM, Stephan Erb <stephan....@blue-yonder.com> >> > wrote: >> > > >> > > That sounds indeed concerning. Would be great if you could file an issue >> > and attach the related log files and tracebacks. >> > > >> > > Bill recently added a potential replacement for the existing restore >> > mechanism: https://github.com/apache/aurora/commit/ >> > 2e1ca42887bc8ea1e8c6cddebe9d1cf29268c714. Given the set of issues you >> > have bumped into with the current restore, this new approach might be worth >> > exploring further. >> > > >> > > On 03.06.18, 08:43, "Meghdoot bhattacharya" >> > <meghdoo...@yahoo.com.INVALID> wrote: >> > > >> > > Thx Renan for sharing the details. This backup restore happened under >> > not so easy circumstances, so would encourage the leads to keep docs >> > updated as much as possible and include in release validation. >> > > >> > > The other issue of snapshots having task and other objects as nil >> > that causes to fail the schedulers, we have now seen 2 times in past year. >> > Other than finding root cause why that entry happens during snapshot >> > creation, there needs to be defensive code either to ignore that entry on >> > loading or a way to fix the snapshot. Because we might have to go through a >> > days worth of snapshots to find which one did not had that entry and >> > recover from there. Mean time to recover gets impacted under the >> > circumstances. One extra info not sure is relevant or not is the corrupted >> > snapshot got created by the admin cli (assumption should not matter whether >> > scheduler triggers or forced via cli) that showed success as well as the >> > aurora logs but then loading it exposed the issue. >> > > >> > > Thx >> > > >> > >> On Jun 2, 2018, at 3:54 PM, Renan DelValle <renanidelva...@gmail.com> >> > wrote: >> > >> >> > >> Hi all, >> > >> >> > >> We tried following the recovery instructions from >> > >> http://aurora.apache.org/documentation/latest/ >> > operations/backup-restore/ >> > >> >> > >> After our change from the Twitter commons ZK library to Apache Curator, >> > >> these instructions are no longer valid. >> > >> >> > >> In order for Aurora to carry out a leader election in the current state, >> > >> Aurora has to first connect to a Mesos master. What we ended up doing >> > was >> > >> connecting to Mesos master that was had nothing on it to bypass this new >> > >> requirement. >> > >> >> > >> Next, wiping away -native_log_file_path did not seem to be enough to >> > >> recover from a corrupted mesos replicated log. We had to manually wipe >> > away >> > >> entries in ZK and move the snapshot backup directory in order for the >> > >> leader to not fall back on either a snapshot or the mesos-log to >> > rehydrate >> > >> the leader. >> > >> >> > >> Finally, somehow triggering a manual snapshot generated a snapshot with >> > an >> > >> invalid entry which then caused the scheduler to fail after a failover >> > >> while trying to catch up on current state. >> > >> >> > >> We are trying to investigate why this took place (it could have been we >> > >> didn't give the system enough time to finish hydrating the snapshot), >> > but >> > >> the invalid entry which looked something like a Task with all null or 0 >> > >> values, caused our leaders to fail (which necessitated restoring from an >> > >> earlier snapshot) and note that this was only after we triggered the >> > manual >> > >> snapshot and BEFORE we tried to restore. >> > >> >> > >> Will report more details as they become available and will provide some >> > doc >> > >> updates based on our experience. >> > >> >> > >> -Renan >> > > >> > > >> > > >> > >> > >> >> >