[jira] [Comment Edited] (AURORA-1605) Update recovery docs to reflect changes
[ https://issues.apache.org/jira/browse/AURORA-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124606#comment-15124606 ] Maxim Khutornenko edited comment on AURORA-1605 at 2/4/16 5:07 PM: --- Just dumping it here before I forget: * mesos master auth had to be turned off (need to figure out why that was necessary) * with kerberos auth ON: local admin run required modifying the hosts file on a leader to alias localhost to the URL used in keytabs * leader redirect has to be OFF (AURORA-1601) Also: update thrift deprecation guidelines with dual backfill instructions to explicitly account for version rollback scenarios. was (Author: maximk): Just dumping it here before I forget: * mesos master auth had to be turned off (need to figure out why that was necessary) * with kerberos auth ON: local admin run required modifying the hosts file on a leader to alias localhost to the URL used in keytabs * leader redirect has to be OFF (AURORA-1601) > Update recovery docs to reflect changes > --- > > Key: AURORA-1605 > URL: https://issues.apache.org/jira/browse/AURORA-1605 > Project: Aurora > Issue Type: Task > Components: Documentation >Reporter: Joshua Cohen >Priority: Minor > > We had to restore one of our clusters from backup recently, and it turns out > there's been some drift between the [documented > process](https://github.com/apache/aurora/blob/f630bf705ac8a9de2b7b987858ada3b876f65abf/docs/storage-config.md#recovering-from-a-scheduler-backup) > and what's currently necessary. > Specifically, we needed to disable the leader redirect filter and, I believe, > mesos authentication. > We should make sure the recovery docs are up to date with what's actually > required. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (AURORA-1605) Update recovery docs to reflect changes
[ https://issues.apache.org/jira/browse/AURORA-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15130751#comment-15130751 ] John Sirois edited comment on AURORA-1605 at 2/3/16 6:07 PM: - I went through the docs using test_kerberos_end_to_end.sh and hit a few roadblocks / things that do not jive with the description in this ticket. I'm sure I'm missing obvious things, but if not, my experience is detailed below. h5. Setup environment to test recovery # I edited test_kerberos_end_to_end.sh to skip tear-down and then ran it to setup the kerberized scheduler # I ssh'd to vagrant and ran the steps through setup manually to get kinit'd as root for the aurora_admin commands I'd need to run, ie roughly: {noformat} cd ~/krb5-1.13.1/build make testrealm SCHEDULER_HOSTNAME=aurora.local kadmin.local -q "addprinc -randkey HTTP/$SCHEDULER_HOSTNAME" rm -f testdir/HTTP-$SCHEDULER_HOSTNAME.keytab.keytab kadmin.local -q "ktadd -keytab testdir/HTTP-$SCHEDULER_HOSTNAME.keytab HTTP/$SCHEDULER_HOSTNAME" kadmin.local -q "addprinc -randkey root" rm -f testdir/root.keytab kadmin.local -q "ktadd -keytab testdir/root.keytab root" kinit -k -t "testdir/root.keytab" root {noformat} # {{aurora_admin scheduler_backup_now devcluster && aurora_admin scheduler_list_backups devcluster}} h5. Do a restore I ran through the restore docs as detailed below: h6. Preparation {noformat} $ diff /etc/init/aurora-scheduler-kerberos.conf /etc/init/aurora-scheduler-kerberos.pre-recovery.conf 42,44c42 < -mesos_master_address=zk://localhost:181/mesos/master \ < -max_registration_delay=365days \ < -reconciliation_initial_delay=365days \ --- > -mesos_master_address=zk://localhost:2181/mesos/master \ {noformat} h6. Restore from backup The leading scheduler could only be identifed via logs: {noformat} sudo grep "Elected as leading scheduler" /var/log/upstart/aurora-scheduler-kerberos.log | tail -1 I0203 16:57:05.336 [main, SchedulerLifecycle$5:238] Elected as leading scheduler! {noformat} or examining zk nodes: {noformat} /usr/share/zookeeper/bin/zkCli.sh ls /aurora/scheduler ... Connecting to localhost:2181 WATCHER:: WatchedEvent state:SyncConnected type:None path:null [singleton_candidate_27] /usr/share/zookeeper/bin/zkCli.sh get /aurora/scheduler/singleton_candidate_27 ... Connecting to localhost:2181 WATCHER:: WatchedEvent state:SyncConnected type:None path:null 127.0.1.1 cZxid = 0x17b ctime = Wed Feb 03 17:12:43 UTC 2016 mZxid = 0x17b mtime = Wed Feb 03 17:12:43 UTC 2016 pZxid = 0x17b cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x152a7d138160090 dataLength = 9 numChildren = 0 {noformat} All aurora_admin commands fail at this point though with this flavor (ie: {{aurora_admin get_scheduler}}, {{aurora_admin scheduler_list_backups}}, etc.) : {noformat} aurora_admin scheduler_stage_recovery -v --bypass-leader-redirect devcluster scheduler-backup-2016-02-03-16-32 DEBUG] Using auth module: INFO] Connecting to 192.168.33.7:2181 INFO] Sending request(xid=None): Connect(protocol_version=0, last_zxid_seen=0, time_out=1, session_id=0, passwd='\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', read_only=None) INFO] Zookeeper connection established, state: CONNECTED INFO] Sending request(xid=1): GetChildren(path=u'/aurora/scheduler', watcher=) INFO] Received response(xid=1): [u'singleton_candidate_22'] INFO] Sending request(xid=2): GetChildren(path=u'/aurora/scheduler', watcher=None) INFO] Received response(xid=2): [u'singleton_candidate_22'] WARN] Could not connect to scheduler: No schedulers detected in devcluster! {noformat} As a result, the only way to complete the rest of the guide was to re-edit {{/etc/init/aurora-scheduler-kerberos.conf}} and restore the correct {{-mesos_master_address}}. After doing this and bouncing the scheduler I could run aurora_admin commands and successfully complete the restore via the rest of the guide. So... it seems to me the guide needs to - at a high-level, suggest: # All schedulers are stopped (say 5 of them). # All but one scheduler (4 in this example) are prepared as in "Preparation", but 1 scheduler is prepared as in "Preparation" except for the bit about setting an invalid {{-mesos_master_address}} and with the addition of emphasizing the bit about port-blocking to prevent user-activity. This special scheduler will be used to run the recovery staging, review and commit. If I have this approximately right, I concur with [~StephanErb]'s second comment above - the 1st "Identify the leading scheduler by" will then always work, ie {{aurora_admin get_scheduler}} - but its beside the point since the preparation already singled out a leader to run the recovery against. This leads me to think the purpose of the "Identify the leading scheduler by" section is
[jira] [Comment Edited] (AURORA-1605) Update recovery docs to reflect changes
[ https://issues.apache.org/jira/browse/AURORA-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15130751#comment-15130751 ] John Sirois edited comment on AURORA-1605 at 2/3/16 6:00 PM: - I went through the docs using test_kerberos_end_to_end.sh and hit a few roadblocks / things that do not jive with the description in this ticket. I'm sure I'm missing obvious things, but if not, my experience is detailed below. h5. Setup environment to test recovery # I edited test_kerberos_end_to_end.sh to skip tear-down and then ran it to setup the kerberized scheduler # I ssh'd to vagrant and ran the steps through setup manually to get kinit'd as root for the aurora_admin commands I'd need to run, ie roughly: {noformat} cd ~/krb5-1.13.1/build make testrealm SCHEDULER_HOSTNAME=aurora.local kadmin.local -q "addprinc -randkey HTTP/$SCHEDULER_HOSTNAME" rm -f testdir/HTTP-$SCHEDULER_HOSTNAME.keytab.keytab kadmin.local -q "ktadd -keytab testdir/HTTP-$SCHEDULER_HOSTNAME.keytab HTTP/$SCHEDULER_HOSTNAME" kadmin.local -q "addprinc -randkey root" rm -f testdir/root.keytab kadmin.local -q "ktadd -keytab testdir/root.keytab root" kinit -k -t "testdir/root.keytab" root {noformat} # aurora_admin scheduler_backup_now devcluster && aurora_admin scheduler_list_backups devcluster h5. Do a restore I ran through the restore docs as with details below: h6. Preparation {noformat} $ diff /etc/init/aurora-scheduler-kerberos.conf /etc/init/aurora-scheduler-kerberos.pre-recovery.conf 42,44c42 < -mesos_master_address=zk://localhost:181/mesos/master \ < -max_registration_delay=365days \ < -reconciliation_initial_delay=365days \ --- > -mesos_master_address=zk://localhost:2181/mesos/master \ {noformat} h6. Restore from backup The leading scheduler could only be identifed via logs: {noformat} sudo grep "Elected as leading scheduler" /var/log/upstart/aurora-scheduler-kerberos.log | tail -1 I0203 16:57:05.336 [main, SchedulerLifecycle$5:238] Elected as leading scheduler! {noformat} or examining zk nodes: {noformat} /usr/share/zookeeper/bin/zkCli.sh ls /aurora/scheduler ... Connecting to localhost:2181 WATCHER:: WatchedEvent state:SyncConnected type:None path:null [singleton_candidate_27] /usr/share/zookeeper/bin/zkCli.sh get /aurora/scheduler/singleton_candidate_27 ... Connecting to localhost:2181 WATCHER:: WatchedEvent state:SyncConnected type:None path:null 127.0.1.1 cZxid = 0x17b ctime = Wed Feb 03 17:12:43 UTC 2016 mZxid = 0x17b mtime = Wed Feb 03 17:12:43 UTC 2016 pZxid = 0x17b cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x152a7d138160090 dataLength = 9 numChildren = 0 {noformat} All aurora_admin commands fail at this point though with this flavor (ie: {{aurora_admin get_scheduler}}, {{aurora_admin scheduler_list_backups}}, etc.) : {noformat} aurora_admin scheduler_stage_recovery -v --bypass-leader-redirect devcluster scheduler-backup-2016-02-03-16-32 DEBUG] Using auth module: INFO] Connecting to 192.168.33.7:2181 INFO] Sending request(xid=None): Connect(protocol_version=0, last_zxid_seen=0, time_out=1, session_id=0, passwd='\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', read_only=None) INFO] Zookeeper connection established, state: CONNECTED INFO] Sending request(xid=1): GetChildren(path=u'/aurora/scheduler', watcher=) INFO] Received response(xid=1): [u'singleton_candidate_22'] INFO] Sending request(xid=2): GetChildren(path=u'/aurora/scheduler', watcher=None) INFO] Received response(xid=2): [u'singleton_candidate_22'] WARN] Could not connect to scheduler: No schedulers detected in devcluster! {noformat} As a result, the only way to complete the rest of the guide was to re-edit {{/etc/init/aurora-scheduler-kerberos.conf}} and restore the correct {{-mesos_master_address}}. After doing this and bouncing the scheduler I could run aurora_admin commands and successfully complete the restore via the rest of the guide. So... it seems to me the guide needs to - at a high-level, suggest: # All schedulers are stopped (say 5 of them). # All but one scheduler (4 in this example) are prepared as in "Preparation", but 1 scheduler is prepared as in "Preparation" except for the bit about setting an invalid {{-mesos_master_address}} and with the addition of emphasizing the bit about port-blocking to prevent user-activity. This special scheduler will be used to run the recovery staging, review and commit. If I have this approximately right, I concure with [~StephanErb]'s second comment above - the 1st "Identify the leading scheduler by" will then always work, ie {{aurora_admin get_scheduler}} - but its beside the point since the preparation already singled out a leader to run the recovery against. This leads me to think the purpose of the "Identify the leading scheduler by" section i
[jira] [Comment Edited] (AURORA-1605) Update recovery docs to reflect changes
[ https://issues.apache.org/jira/browse/AURORA-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15128763#comment-15128763 ] Stephan Erb edited comment on AURORA-1605 at 2/2/16 6:45 PM: - The docs regarding invalid {{mesos_master_address}} (https://github.com/apache/aurora/blob/f630bf705ac8a9de2b7b987858ada3b876f65abf/docs/storage-config.md#preparation) should have slightly more explicit example: The current example {{-mesos_master_address=zk://localhost:2181}} might actually be a valid ZK but with a missing {{/mesos}} node. Changing it to {{-mesos_master_address=zk://localhost:9}} will probably be more obvious for readers. was (Author: stephanerb): The docs regarding invalid `mesos_master_address` (https://github.com/apache/aurora/blob/f630bf705ac8a9de2b7b987858ada3b876f65abf/docs/storage-config.md#preparation) should have slightly more explicit example: The current example `-mesos_master_address=zk://localhost:2181` might actually be a valid ZK but with a missing `/mesos` node. Changing it to `-mesos_master_address=zk://localhost:9` will probably be more obvious for readers. > Update recovery docs to reflect changes > --- > > Key: AURORA-1605 > URL: https://issues.apache.org/jira/browse/AURORA-1605 > Project: Aurora > Issue Type: Task > Components: Documentation >Reporter: Joshua Cohen >Priority: Minor > > We had to restore one of our clusters from backup recently, and it turns out > there's been some drift between the [documented > process](https://github.com/apache/aurora/blob/f630bf705ac8a9de2b7b987858ada3b876f65abf/docs/storage-config.md#recovering-from-a-scheduler-backup) > and what's currently necessary. > Specifically, we needed to disable the leader redirect filter and, I believe, > mesos authentication. > We should make sure the recovery docs are up to date with what's actually > required. -- This message was sent by Atlassian JIRA (v6.3.4#6332)