> On Sept. 13, 2016, 5:11 p.m., Zameer Manji wrote:
> > I support this change as a developer.
> > 
> > As an operator I am scared.
> > 
> > What happens to an existing cluster if we don't set `framework_name`? Will 
> > it register another frameowork_id? (bad) or will it fail to register? 
> > (better).
> 
> Santhosh Kumar Shanmugham wrote:
>     The restarting framework will be treated like a scheduler fail-over.
> 
> Zameer Manji wrote:
>     The release notes in this patch says
>     > Update default value of command line option `-framework_name` to 
> 'aurora'. Please be aware that
>       depending on your usage of Mesos, this will be a backward incompatible 
> change.
>       
>     I'm trying to understand the implications of the backwards 
> incompatability. Will the scheduler fail to register or will it register 
> under a new frameworkid (and then lose track of previous tasks?)
> 
> Joshua Cohen wrote:
>     Santhosh, did you verify this in vagrant with a scheduler that already 
> had tasks running? If it is backwards compatible then we can probably adjust 
> the release notes?

Results from testing in Vagrant cluster,

Renaming framework from 'TwitterScheduler' to 'Aurora':

The framework re-registers after restart (treated by master as failover) and 
gets the same framework-id and performs task reconciliation thereby restoring 
the tasks.

I0914 16:48:28.408182  9815 master.cpp:1297] Giving framework 
071c44a1-b4d4-4339-a727-03a79f725851-0000 (TwitterScheduler) at 
scheduler-75517c8f-5913-49e9-8cc4-342a78c9bbcb@192.168.33.7:8083 3weeks to 
failover
I0914 16:48:28.408226  9815 hierarchical.cpp:382] Deactivated framework 
071c44a1-b4d4-4339-a727-03a79f725851-0000
E0914 16:48:28.408617  9819 process.cpp:2105] Failed to shutdown socket with fd 
28: Transport endpoint is not connected
I0914 16:48:43.722126  9813 master.cpp:2424] Received SUBSCRIBE call for 
framework 'Aurora' at 
scheduler-dfad8309-de4b-47d8-a8f8-82828ea40a12@192.168.33.7:8083
I0914 16:48:43.722190  9813 master.cpp:2500] Subscribing framework Aurora with 
checkpointing enabled and capabilities [ REVOCABLE_RESOURCES, GPU_RESOURCES ]
I0914 16:48:43.722225  9813 master.cpp:2564] Updating info for framework 
071c44a1-b4d4-4339-a727-03a79f725851-0000
I0914 16:48:43.722256  9813 master.cpp:2577] Framework 
071c44a1-b4d4-4339-a727-03a79f725851-0000 (Aurora) at 
scheduler-75517c8f-5913-49e9-8cc4-342a78c9bbcb@192.168.33.7:8083 failed over
I0914 16:48:43.722429  9813 hierarchical.cpp:348] Activated framework 
071c44a1-b4d4-4339-a727-03a79f725851-0000
I0914 16:48:43.722595  9813 master.cpp:5709] Sending 1 offers to framework 
071c44a1-b4d4-4339-a727-03a79f725851-0000 (Aurora) at 
scheduler-dfad8309-de4b-47d8-a8f8-82828ea40a12@192.168.33.7:8083
I0914 16:49:44.204677  9812 master.cpp:5447] Performing explicit task state 
reconciliation for 1 tasks of framework 
071c44a1-b4d4-4339-a727-03a79f725851-0000 (Aurora) at 
scheduler-dfad8309-de4b-47d8-a8f8-82828ea40a12@192.168.33.7:8083

Rolling back framework name to 'TwitterScheduler' from 'Aurora':

Same here.

I0914 16:51:33.203495  9812 master.cpp:1297] Giving framework 
071c44a1-b4d4-4339-a727-03a79f725851-0000 (Aurora) at 
scheduler-dfad8309-de4b-47d8-a8f8-82828ea40a12@192.168.33.7:8083 3weeks to 
failover
I0914 16:51:33.203526  9812 hierarchical.cpp:382] Deactivated framework 
071c44a1-b4d4-4339-a727-03a79f725851-0000
I0914 16:51:49.614074  9813 master.cpp:2424] Received SUBSCRIBE call for 
framework 'TwitterScheduler' at 
scheduler-6fa8b819-aed9-42e1-9c6c-3e4be2f62500@192.168.33.7:8083
I0914 16:51:49.614215  9813 master.cpp:2500] Subscribing framework 
TwitterScheduler with checkpointing enabled and capabilities [ 
REVOCABLE_RESOURCES, GPU_RESOURCES ]
I0914 16:51:49.614312  9813 master.cpp:2564] Updating info for framework 
071c44a1-b4d4-4339-a727-03a79f725851-0000
I0914 16:51:49.614359  9813 master.cpp:2577] Framework 
071c44a1-b4d4-4339-a727-03a79f725851-0000 (TwitterScheduler) at 
scheduler-dfad8309-de4b-47d8-a8f8-82828ea40a12@192.168.33.7:8083 failed over
I0914 16:51:49.614977  9813 hierarchical.cpp:348] Activated framework 
071c44a1-b4d4-4339-a727-03a79f725851-0000
I0914 16:51:49.615170  9813 master.cpp:5709] Sending 1 offers to framework 
071c44a1-b4d4-4339-a727-03a79f725851-0000 (TwitterScheduler) at 
scheduler-6fa8b819-aed9-42e1-9c6c-3e4be2f62500@192.168.33.7:8083
I0914 16:52:50.315119  9812 master.cpp:5447] Performing explicit task state 
reconciliation for 1 tasks of framework 
071c44a1-b4d4-4339-a727-03a79f725851-0000 (TwitterScheduler) at 
scheduler-6fa8b819-aed9-42e1-9c6c-3e4be2f62500@192.168.33.7:8083

Restarting the scheduler after updating the config to 'TwitterScheduler' from 
'Aurora':

Rename did not take effect. The master re-registered the framework to the same 
id and performed a task reconciliation.

I0914 20:11:49.178103 28171 master.cpp:1297] Giving framework 
071c44a1-b4d4-4339-a727-03a79f725851-0000 (Aurora) at 
scheduler-c42cd8cf-09a0-4d81-a947-094c4fac601e@192.168.33.7:8083 3weeks to 
failover
I0914 20:11:49.178138 28171 hierarchical.cpp:382] Deactivated framework 
071c44a1-b4d4-4339-a727-03a79f725851-0000
E0914 20:11:49.183275 28178 process.cpp:2105] Failed to shutdown socket with fd 
29: Transport endpoint is not connected
I0914 20:12:33.277560 28177 master.cpp:2424] Received SUBSCRIBE call for 
framework 'Aurora' at 
scheduler-6dcb9baa-503f-44a9-9df6-79da717f3a1c@192.168.33.7:8083
I0914 20:12:33.277710 28177 master.cpp:2500] Subscribing framework Aurora with 
checkpointing enabled and capabilities [ REVOCABLE_RESOURCES, GPU_RESOURCES ]
I0914 20:12:33.277753 28177 master.cpp:2564] Updating info for framework 
071c44a1-b4d4-4339-a727-03a79f725851-0000
I0914 20:12:33.277784 28177 master.cpp:2577] Framework 
071c44a1-b4d4-4339-a727-03a79f725851-0000 (Aurora) at 
scheduler-c42cd8cf-09a0-4d81-a947-094c4fac601e@192.168.33.7:8083 failed over
I0914 20:12:33.277961 28177 hierarchical.cpp:348] Activated framework 
071c44a1-b4d4-4339-a727-03a79f725851-0000
I0914 20:12:33.278136 28177 master.cpp:5709] Sending 1 offers to framework 
071c44a1-b4d4-4339-a727-03a79f725851-0000 (Aurora) at 
scheduler-6dcb9baa-503f-44a9-9df6-79da717f3a1c@192.168.33.7:8083
I0914 20:13:33.848175 28175 master.cpp:5447] Performing explicit task state 
reconciliation for 1 tasks of framework 
071c44a1-b4d4-4339-a727-03a79f725851-0000 (Aurora) at 
scheduler-6dcb9baa-503f-44a9-9df6-79da717f3a1c@192.168.33.7:8083

In all the above cases the running task was not affected and was available in 
the UI after the scheduler restarted.


- Santhosh Kumar


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51874/#review148816
-----------------------------------------------------------


On Sept. 13, 2016, 5:18 p.m., Santhosh Kumar Shanmugham wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51874/
> -----------------------------------------------------------
> 
> (Updated Sept. 13, 2016, 5:18 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1688
>     https://issues.apache.org/jira/browse/AURORA-1688
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Change framework_name default value from 'TwitterScheduler' to 'aurora'
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md ad2c68a6defe07c94480d7dee5b1496b50dc34e5 
>   
> src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java
>  8a386bd208956eb0c8c2f48874b0c6fb3af58872 
>   src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh 
> 97677f24a50963178a123b420d7ac136e4fde3fe 
> 
> Diff: https://reviews.apache.org/r/51874/diff/
> 
> 
> Testing
> -------
> 
> ./build-support/jenkins/build.sh
> ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> 
> Thanks,
> 
> Santhosh Kumar Shanmugham
> 
>

Reply via email to