Thx Bill. We will take it up.

On Jun 5, 2018, at 6:57 AM, Bill Farner <wfar...@apache.org> wrote:

>> How does the site get updated? Is it auto-generated when we build releases?
> 
> The source lives in the project SVN repo:
> 
> $ svn co https://svn.apache.org/repos/asf/aurora/ aurora-svn
> 
> Here are instructions for updating it.  It's a pretty mechanical process, but 
> not automated.
> 
>> So, do we plan to add the patch for next release?
> 
> Meghdoot - i suspect David would appreciate an incoming patch for the docs, 
> assuming that's what you're referring to.
> 
> He mentioned this step in the end-to-end tests, which is (hopefully) 
> straightforward enough to try without assistance.
> 
> 
> 
>> On Tue, Jun 5, 2018 at 12:47 AM, meghdoot bhattacharya 
>> <meghdoo...@yahoo.com.invalid> wrote:
>> Thx David. So, do we plan to add the patch for next release? We will be 
>> happy to validate it as part of rc validation.
>> 
>> 
>> 
>>       From: David McLaughlin <dmclaugh...@apache.org>
>>  To: dev@aurora.apache.org 
>>  Sent: Monday, June 4, 2018 9:45 AM
>>  Subject: Re: Recovery instructions updates
>>    
>> We should definitely update that doc, Bill's patch makes this much easier
>> (as can be seen by the e2e test) and we've been using it in our scale test
>> environment. How does the site get updated? Is it auto-generated when we
>> build releases?
>> 
>> Having corrupted logs that frequently is concerning too, we haven't seen
>> anything like this and we do explicit snapshots/backups as part of every
>> Scheduler deploy. If there's a bug lurking, would be good to get in front
>> of it.
>> 
>> On Sun, Jun 3, 2018 at 10:40 AM, Meghdoot bhattacharya <
>> meghdoo...@yahoo.com.invalid> wrote:
>> 
>> > We will try to recover the log files on the snapshot loading error.
>> >
>> > + 1 to Bill’s approach on making offline recovery. We will try the patch
>> > on our side.
>> >
>> > Renan, I would ask you to prepare a PR for the restoration docs proposing
>> > the 2 additional steps required in current world as we look to maybe using
>> > a different mechanism. The prep steps to get scheduler ready for backup can
>> > be eliminated hopefully with the alternative approach.
>> >
>> > On side lets see if we can recover the logs of the corrupted snapshot
>> > loading.
>> >
>> >
>> > Thx
>> >
>> > > On Jun 3, 2018, at 9:50 AM, Stephan Erb <stephan....@blue-yonder.com>
>> > wrote:
>> > >
>> > > That sounds indeed concerning. Would be great if you could file an issue
>> > and attach the related log files and tracebacks.
>> > >
>> > > Bill recently added a potential replacement for the existing restore
>> > mechanism: https://github.com/apache/aurora/commit/
>> > 2e1ca42887bc8ea1e8c6cddebe9d1cf29268c714. Given the set of issues you
>> > have bumped into with the current restore, this new approach might be worth
>> > exploring further.
>> > >
>> > > On 03.06.18, 08:43, "Meghdoot bhattacharya"
>> > <meghdoo...@yahoo.com.INVALID> wrote:
>> > >
>> > >    Thx Renan for sharing the details. This backup restore happened under
>> > not so easy circumstances, so would encourage the leads to keep docs
>> > updated as much as possible and include in release validation.
>> > >
>> > >    The other issue of snapshots having task and other objects as nil
>> > that causes to fail the schedulers, we have now seen 2 times in past year.
>> > Other than finding root cause why that entry happens during snapshot
>> > creation, there needs to be defensive code either to ignore that entry on
>> > loading or a way to fix the snapshot. Because we might have to go through a
>> > days worth of snapshots to find which one did not had that entry and
>> > recover from there. Mean time to recover gets impacted under the
>> > circumstances. One extra info not sure is relevant or not is the corrupted
>> > snapshot got created by the admin cli (assumption should not matter whether
>> > scheduler triggers or forced via cli) that showed success as well as the
>> > aurora logs but then loading it exposed the issue.
>> > >
>> > >    Thx
>> > >
>> > >> On Jun 2, 2018, at 3:54 PM, Renan DelValle <renanidelva...@gmail.com>
>> > wrote:
>> > >>
>> > >> Hi all,
>> > >>
>> > >> We tried following the recovery instructions from
>> > >> http://aurora.apache.org/documentation/latest/
>> > operations/backup-restore/
>> > >>
>> > >> After our change from the Twitter commons ZK library to Apache Curator,
>> > >> these instructions are no longer valid.
>> > >>
>> > >> In order for Aurora to carry out a leader election in the current state,
>> > >> Aurora has to first connect to a Mesos master. What we ended up doing
>> > was
>> > >> connecting to Mesos master that was had nothing on it to bypass this new
>> > >> requirement.
>> > >>
>> > >> Next, wiping away -native_log_file_path did not seem to be enough to
>> > >> recover from a corrupted mesos replicated log. We had to manually wipe
>> > away
>> > >> entries in ZK and move the snapshot backup directory in order for the
>> > >> leader to not fall back on either a snapshot or the mesos-log to
>> > rehydrate
>> > >> the leader.
>> > >>
>> > >> Finally, somehow triggering a manual snapshot generated a snapshot with
>> > an
>> > >> invalid entry which then caused the scheduler to fail after a failover
>> > >> while trying to catch up on current state.
>> > >>
>> > >> We are trying to investigate why this took place (it could have been we
>> > >> didn't give the system enough time to finish hydrating the snapshot),
>> > but
>> > >> the invalid entry which looked something like a Task with all null or 0
>> > >> values, caused our leaders to fail (which necessitated restoring from an
>> > >> earlier snapshot) and note that this was only after we triggered the
>> > manual
>> > >> snapshot and BEFORE we tried to restore.
>> > >>
>> > >> Will report more details as they become available and will provide some
>> > doc
>> > >> updates based on our experience.
>> > >>
>> > >> -Renan
>> > >
>> > >
>> > >
>> >
>> >
>> 
>>    
> 

Reply via email to