Thx Renan for sharing the details. This backup restore happened under not so 
easy circumstances, so would encourage the leads to keep docs updated as much 
as possible and include in release validation.

The other issue of snapshots having task and other objects as nil that causes 
to fail the schedulers, we have now seen 2 times in past year. Other than 
finding root cause why that entry happens during snapshot creation, there needs 
to be defensive code either to ignore that entry on loading or a way to fix the 
snapshot. Because we might have to go through a days worth of snapshots to find 
which one did not had that entry and recover from there. Mean time to recover 
gets impacted under the circumstances. One extra info not sure is relevant or 
not is the corrupted snapshot got created by the admin cli (assumption should 
not matter whether scheduler triggers or forced via cli) that showed success as 
well as the aurora logs but then loading it exposed the issue.

Thx

> On Jun 2, 2018, at 3:54 PM, Renan DelValle <renanidelva...@gmail.com> wrote:
> 
> Hi all,
> 
> We tried following the recovery instructions from
> http://aurora.apache.org/documentation/latest/operations/backup-restore/
> 
> After our change from the Twitter commons ZK library to Apache Curator,
> these instructions are no longer valid.
> 
> In order for Aurora to carry out a leader election in the current state,
> Aurora has to first connect to a Mesos master. What we ended up doing was
> connecting to Mesos master that was had nothing on it to bypass this new
> requirement.
> 
> Next, wiping away -native_log_file_path did not seem to be enough to
> recover from a corrupted mesos replicated log. We had to manually wipe away
> entries in ZK and move the snapshot backup directory in order for the
> leader to not fall back on either a snapshot or the mesos-log to rehydrate
> the leader.
> 
> Finally, somehow triggering a manual snapshot generated a snapshot with an
> invalid entry which then caused the scheduler to fail after a failover
> while trying to catch up on current state.
> 
> We are trying to investigate why this took place (it could have been we
> didn't give the system enough time to finish hydrating the snapshot), but
> the invalid entry which looked something like a Task with all null or 0
> values, caused our leaders to fail (which necessitated restoring from an
> earlier snapshot) and note that this was only after we triggered the manual
> snapshot and BEFORE we tried to restore.
> 
> Will report more details as they become available and will provide some doc
> updates based on our experience.
> 
> -Renan

Reply via email to