[jira] [Comment Edited] (AURORA-1605) Update recovery docs to reflect changes

2016-02-04 Thread Maxim Khutornenko (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15124606#comment-15124606
 ] 

Maxim Khutornenko edited comment on AURORA-1605 at 2/4/16 5:07 PM:
---

Just dumping it here before I forget:
* mesos master auth had to be turned off (need to figure out why that was 
necessary)
* with kerberos auth ON: local admin run required modifying the hosts file on a 
leader to alias localhost to the URL used in keytabs
* leader redirect has to be OFF (AURORA-1601)

Also: update thrift deprecation guidelines with dual backfill instructions to 
explicitly account for version rollback scenarios.


was (Author: maximk):
Just dumping it here before I forget:
* mesos master auth had to be turned off (need to figure out why that was 
necessary)
* with kerberos auth ON: local admin run required modifying the hosts file on a 
leader to alias localhost to the URL used in keytabs
* leader redirect has to be OFF (AURORA-1601)


> Update recovery docs to reflect changes
> ---
>
> Key: AURORA-1605
> URL: https://issues.apache.org/jira/browse/AURORA-1605
> Project: Aurora
>  Issue Type: Task
>  Components: Documentation
>Reporter: Joshua Cohen
>Priority: Minor
>
> We had to restore one of our clusters from backup recently, and it turns out 
> there's been some drift between the [documented 
> process](https://github.com/apache/aurora/blob/f630bf705ac8a9de2b7b987858ada3b876f65abf/docs/storage-config.md#recovering-from-a-scheduler-backup)
>  and what's currently necessary.
> Specifically, we needed to disable the leader redirect filter and, I believe, 
> mesos authentication.
> We should make sure the recovery docs are up to date with what's actually 
> required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (AURORA-1605) Update recovery docs to reflect changes

2016-02-03 Thread John Sirois (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130751#comment-15130751
 ] 

John Sirois edited comment on AURORA-1605 at 2/3/16 6:00 PM:
-

I went through the docs using test_kerberos_end_to_end.sh and hit a few 
roadblocks / things that do not jive with the description in this ticket.  I'm 
sure I'm missing obvious things, but if not, my experience is detailed below.

h5. Setup environment to test recovery
# I edited test_kerberos_end_to_end.sh to skip tear-down and then ran it to 
setup the kerberized scheduler
# I ssh'd to vagrant and ran the steps through setup manually to get kinit'd as 
root for the aurora_admin commands I'd need to run, ie roughly:
{noformat}
cd ~/krb5-1.13.1/build
make testrealm
SCHEDULER_HOSTNAME=aurora.local
kadmin.local -q "addprinc -randkey HTTP/$SCHEDULER_HOSTNAME"
rm -f testdir/HTTP-$SCHEDULER_HOSTNAME.keytab.keytab
kadmin.local -q "ktadd -keytab testdir/HTTP-$SCHEDULER_HOSTNAME.keytab 
HTTP/$SCHEDULER_HOSTNAME"
kadmin.local -q "addprinc -randkey root"
rm -f testdir/root.keytab
kadmin.local -q "ktadd -keytab testdir/root.keytab root"
kinit -k -t "testdir/root.keytab" root
{noformat}
# aurora_admin scheduler_backup_now devcluster && aurora_admin 
scheduler_list_backups devcluster

h5. Do a restore

I ran through the restore docs as with details below:

h6. Preparation

{noformat}
$ diff /etc/init/aurora-scheduler-kerberos.conf 
/etc/init/aurora-scheduler-kerberos.pre-recovery.conf 
42,44c42
<   -mesos_master_address=zk://localhost:181/mesos/master \
<   -max_registration_delay=365days \
<   -reconciliation_initial_delay=365days \
---
>   -mesos_master_address=zk://localhost:2181/mesos/master \
{noformat}

h6. Restore from backup

The leading scheduler could only be identifed via logs:
{noformat}
sudo grep "Elected as leading scheduler" 
/var/log/upstart/aurora-scheduler-kerberos.log | tail -1
I0203 16:57:05.336 [main, SchedulerLifecycle$5:238] Elected as leading 
scheduler!
{noformat}
or examining zk nodes:
{noformat}
/usr/share/zookeeper/bin/zkCli.sh ls /aurora/scheduler  
 
...
Connecting to localhost:2181

WATCHER::

WatchedEvent state:SyncConnected type:None path:null
[singleton_candidate_27]
/usr/share/zookeeper/bin/zkCli.sh get 
/aurora/scheduler/singleton_candidate_27
...
Connecting to localhost:2181

WATCHER::

WatchedEvent state:SyncConnected type:None path:null
127.0.1.1
cZxid = 0x17b
ctime = Wed Feb 03 17:12:43 UTC 2016
mZxid = 0x17b
mtime = Wed Feb 03 17:12:43 UTC 2016
pZxid = 0x17b
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x152a7d138160090
dataLength = 9
numChildren = 0
{noformat}
All aurora_admin commands fail at this point though with this flavor (ie: 
{{aurora_admin get_scheduler}}, {{aurora_admin scheduler_list_backups}}, etc.) :
{noformat}
aurora_admin scheduler_stage_recovery -v --bypass-leader-redirect devcluster 
scheduler-backup-2016-02-03-16-32
DEBUG] Using auth module: 

 INFO] Connecting to 192.168.33.7:2181
 INFO] Sending request(xid=None): Connect(protocol_version=0, last_zxid_seen=0, 
time_out=1, session_id=0, 
passwd='\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 
read_only=None)
 INFO] Zookeeper connection established, state: CONNECTED
 INFO] Sending request(xid=1): GetChildren(path=u'/aurora/scheduler', 
watcher=)
 INFO] Received response(xid=1): [u'singleton_candidate_22']
 INFO] Sending request(xid=2): GetChildren(path=u'/aurora/scheduler', 
watcher=None)
 INFO] Received response(xid=2): [u'singleton_candidate_22']
 WARN] Could not connect to scheduler: No schedulers detected in devcluster!
{noformat}

As a result, the only way to complete the rest of the guide was to re-edit 
{{/etc/init/aurora-scheduler-kerberos.conf}} and restore the correct 
{{-mesos_master_address}}.  After doing this and bouncing the scheduler I could 
run aurora_admin commands and successfully complete the restore via the rest of 
the guide.

So... it seems to me the guide needs to - at a high-level, suggest:
# All schedulers are stopped (say 5 of them).
# All but one scheduler (4 in this example) are prepared as in "Preparation", 
but 1 scheduler is prepared as in "Preparation" except for the bit about 
setting an invalid {{-mesos_master_address}} and with the addition of 
emphasizing the bit about port-blocking to prevent user-activity.  This special 
scheduler will be used to run the recovery staging, review and commit.

If I have this approximately right, I concure with [~StephanErb]'s second 
comment above - the 1st "Identify the leading scheduler by" will then always 
work, ie {{aurora_admin get_scheduler}} - but its beside the point since the 
preparation already singled out a leader to run the recovery against.

This leads me to think the purpose of the "Identify the leading scheduler by" 
section is to find the 

[jira] [Comment Edited] (AURORA-1605) Update recovery docs to reflect changes

2016-02-03 Thread John Sirois (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130751#comment-15130751
 ] 

John Sirois edited comment on AURORA-1605 at 2/3/16 6:07 PM:
-

I went through the docs using test_kerberos_end_to_end.sh and hit a few 
roadblocks / things that do not jive with the description in this ticket.  I'm 
sure I'm missing obvious things, but if not, my experience is detailed below.

h5. Setup environment to test recovery
# I edited test_kerberos_end_to_end.sh to skip tear-down and then ran it to 
setup the kerberized scheduler
# I ssh'd to vagrant and ran the steps through setup manually to get kinit'd as 
root for the aurora_admin commands I'd need to run, ie roughly:
{noformat}
cd ~/krb5-1.13.1/build
make testrealm
SCHEDULER_HOSTNAME=aurora.local
kadmin.local -q "addprinc -randkey HTTP/$SCHEDULER_HOSTNAME"
rm -f testdir/HTTP-$SCHEDULER_HOSTNAME.keytab.keytab
kadmin.local -q "ktadd -keytab testdir/HTTP-$SCHEDULER_HOSTNAME.keytab 
HTTP/$SCHEDULER_HOSTNAME"
kadmin.local -q "addprinc -randkey root"
rm -f testdir/root.keytab
kadmin.local -q "ktadd -keytab testdir/root.keytab root"
kinit -k -t "testdir/root.keytab" root
{noformat}
# {{aurora_admin scheduler_backup_now devcluster && aurora_admin 
scheduler_list_backups devcluster}}

h5. Do a restore

I ran through the restore docs as detailed below:

h6. Preparation

{noformat}
$ diff /etc/init/aurora-scheduler-kerberos.conf 
/etc/init/aurora-scheduler-kerberos.pre-recovery.conf 
42,44c42
<   -mesos_master_address=zk://localhost:181/mesos/master \
<   -max_registration_delay=365days \
<   -reconciliation_initial_delay=365days \
---
>   -mesos_master_address=zk://localhost:2181/mesos/master \
{noformat}

h6. Restore from backup

The leading scheduler could only be identifed via logs:
{noformat}
sudo grep "Elected as leading scheduler" 
/var/log/upstart/aurora-scheduler-kerberos.log | tail -1
I0203 16:57:05.336 [main, SchedulerLifecycle$5:238] Elected as leading 
scheduler!
{noformat}
or examining zk nodes:
{noformat}
/usr/share/zookeeper/bin/zkCli.sh ls /aurora/scheduler  
 
...
Connecting to localhost:2181

WATCHER::

WatchedEvent state:SyncConnected type:None path:null
[singleton_candidate_27]
/usr/share/zookeeper/bin/zkCli.sh get 
/aurora/scheduler/singleton_candidate_27
...
Connecting to localhost:2181

WATCHER::

WatchedEvent state:SyncConnected type:None path:null
127.0.1.1
cZxid = 0x17b
ctime = Wed Feb 03 17:12:43 UTC 2016
mZxid = 0x17b
mtime = Wed Feb 03 17:12:43 UTC 2016
pZxid = 0x17b
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x152a7d138160090
dataLength = 9
numChildren = 0
{noformat}
All aurora_admin commands fail at this point though with this flavor (ie: 
{{aurora_admin get_scheduler}}, {{aurora_admin scheduler_list_backups}}, etc.) :
{noformat}
aurora_admin scheduler_stage_recovery -v --bypass-leader-redirect devcluster 
scheduler-backup-2016-02-03-16-32
DEBUG] Using auth module: 

 INFO] Connecting to 192.168.33.7:2181
 INFO] Sending request(xid=None): Connect(protocol_version=0, last_zxid_seen=0, 
time_out=1, session_id=0, 
passwd='\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 
read_only=None)
 INFO] Zookeeper connection established, state: CONNECTED
 INFO] Sending request(xid=1): GetChildren(path=u'/aurora/scheduler', 
watcher=)
 INFO] Received response(xid=1): [u'singleton_candidate_22']
 INFO] Sending request(xid=2): GetChildren(path=u'/aurora/scheduler', 
watcher=None)
 INFO] Received response(xid=2): [u'singleton_candidate_22']
 WARN] Could not connect to scheduler: No schedulers detected in devcluster!
{noformat}

As a result, the only way to complete the rest of the guide was to re-edit 
{{/etc/init/aurora-scheduler-kerberos.conf}} and restore the correct 
{{-mesos_master_address}}.  After doing this and bouncing the scheduler I could 
run aurora_admin commands and successfully complete the restore via the rest of 
the guide.

So... it seems to me the guide needs to - at a high-level, suggest:
# All schedulers are stopped (say 5 of them).
# All but one scheduler (4 in this example) are prepared as in "Preparation", 
but 1 scheduler is prepared as in "Preparation" except for the bit about 
setting an invalid {{-mesos_master_address}} and with the addition of 
emphasizing the bit about port-blocking to prevent user-activity.  This special 
scheduler will be used to run the recovery staging, review and commit.

If I have this approximately right, I concur with [~StephanErb]'s second 
comment above - the 1st "Identify the leading scheduler by" will then always 
work, ie {{aurora_admin get_scheduler}} - but its beside the point since the 
preparation already singled out a leader to run the recovery against.

This leads me to think the purpose of the "Identify the leading scheduler by" 
section is to find the