[2/7] aurora git commit: Reorganize Documentation

serb Mon, 28 Mar 2016 13:56:21 -0700

http://git-wip-us.apache.org/repos/asf/aurora/blob/f28f41a7/docs/reference/configuration.md
----------------------------------------------------------------------
diff --git a/docs/reference/configuration.md b/docs/reference/configuration.md
new file mode 100644
index 0000000..fa09fd4
--- /dev/null
+++ b/docs/reference/configuration.md
@@ -0,0 +1,573 @@
+Aurora Configuration Reference
+==============================
+
+Don't know where to start? The Aurora configuration schema is very
+powerful, and configurations can become quite complex for advanced use
+cases.
+
+For examples of simple configurations to get something up and running
+quickly, check out the [Tutorial](../getting-started/tutorial.md). When you 
feel comfortable with the basics, move
+on to the [Configuration Tutorial](configuration-tutorial.md) for more 
in-depth coverage of
+configuration design.
+
+- [Process Schema](#process-schema)
+    - [Process Objects](#process-objects)
+- [Task Schema](#task-schema)
+    - [Task Object](#task-object)
+    - [Constraint Object](#constraint-object)
+    - [Resource Object](#resource-object)
+- [Job Schema](#job-schema)
+    - [Job Objects](#job-objects)
+    - [UpdateConfig Objects](#updateconfig-objects)
+    - [HealthCheckConfig Objects](#healthcheckconfig-objects)
+    - [Announcer Objects](#announcer-objects)
+    - [Container Objects](#container)
+    - [LifecycleConfig Objects](#lifecycleconfig-objects)
+- [Specifying Scheduling Constraints](#specifying-scheduling-constraints)
+- [Template Namespaces](#template-namespaces)
+    - [mesos Namespace](#mesos-namespace)
+    - [thermos Namespace](#thermos-namespace)
+
+
+Process Schema
+==============
+
+Process objects consist of required `name` and `cmdline` attributes. You can 
customize Process
+behavior with its optional attributes. Remember, Processes are handled by 
Thermos.
+
+### Process Objects
+
+  **Attribute Name**  | **Type**    | **Description**
+  ------------------- | :---------: | ---------------------------------
+   **name**           | String      | Process name (Required)
+   **cmdline**        | String      | Command line (Required)
+   **max_failures**   | Integer     | Maximum process failures (Default: 1)
+   **daemon**         | Boolean     | When True, this is a daemon process. 
(Default: False)
+   **ephemeral**      | Boolean     | When True, this is an ephemeral process. 
(Default: False)
+   **min_duration**   | Integer     | Minimum duration between process 
restarts in seconds. (Default: 15)
+   **final**          | Boolean     | When True, this process is a finalizing 
one that should run last. (Default: False)
+   **logger**         | Logger      | Struct defining the log behavior for the 
process. (Default: Empty)
+
+#### name
+
+The name is any valid UNIX filename string (specifically no
+slashes, NULLs or leading periods). Within a Task object, each Process name
+must be unique.
+
+#### cmdline
+
+The command line run by the process. The command line is invoked in a bash
+subshell, so can involve fully-blown bash scripts. However, nothing is
+supplied for command-line arguments so `$*` is unspecified.
+
+#### max_failures
+
+The maximum number of failures (non-zero exit statuses) this process can
+have before being marked permanently failed and not retried. If a
+process permanently fails, Thermos looks at the failure limit of the task
+containing the process (usually 1) to determine if the task has
+failed as well.
+
+Setting `max_failures` to 0 makes the process retry
+indefinitely until it achieves a successful (zero) exit status.
+It retries at most once every `min_duration` seconds to prevent
+an effective denial of service attack on the coordinating Thermos scheduler.
+
+#### daemon
+
+By default, Thermos processes are non-daemon. If `daemon` is set to True, a
+successful (zero) exit status does not prevent future process runs.
+Instead, the process reinvokes after `min_duration` seconds.
+However, the maximum failure limit still applies. A combination of
+`daemon=True` and `max_failures=0` causes a process to retry
+indefinitely regardless of exit status. This should be avoided
+for very short-lived processes because of the accumulation of
+checkpointed state for each process run. When running in Mesos
+specifically, `max_failures` is capped at 100.
+
+#### ephemeral
+
+By default, Thermos processes are non-ephemeral. If `ephemeral` is set to
+True, the process' status is not used to determine if its containing task
+has completed. For example, consider a task with a non-ephemeral
+webserver process and an ephemeral logsaver process
+that periodically checkpoints its log files to a centralized data store.
+The task is considered finished once the webserver process has
+completed, regardless of the logsaver's current status.
+
+#### min_duration
+
+Processes may succeed or fail multiple times during a single task's
+duration. Each of these is called a *process run*. `min_duration` is
+the minimum number of seconds the scheduler waits before running the
+same process.
+
+#### final
+
+Processes can be grouped into two classes: ordinary processes and
+finalizing processes. By default, Thermos processes are ordinary. They
+run as long as the task is considered healthy (i.e., no failure
+limits have been reached.) But once all regular Thermos processes
+finish or the task reaches a certain failure threshold, it
+moves into a "finalization" stage and runs all finalizing
+processes. These are typically processes necessary for cleaning up the
+task, such as log checkpointers, or perhaps e-mail notifications that
+the task completed.
+
+Finalizing processes may not depend upon ordinary processes or
+vice-versa, however finalizing processes may depend upon other
+finalizing processes and otherwise run as a typical process
+schedule.
+
+#### logger
+
+The default behavior of Thermos is to store  stderr/stdout logs in files which 
grow unbounded.
+In the event that you have large log volume, you may want to configure Thermos 
to automatically rotate logs
+after they grow to a certain size, which can prevent your job from using more 
than its allocated
+disk space.
+
+A Logger union consists of a destination enum, a mode enum and a rotation 
policy.
+It's to set where the process logs should be sent using `destination`. Default
+option is `file`. Its also possible to specify `console` to get logs output
+to stdout/stderr, `none` to suppress any logs output or `both` to send logs to 
files and
+console output. In case of using `none` or `console` rotation attributes are 
ignored.
+Rotation policies only apply to loggers whose mode is `rotate`. The acceptable 
values
+for the LoggerMode enum are `standard` and `rotate`. The rotation policy 
applies to both
+stderr and stdout.
+
+By default, all processes use the `standard` LoggerMode.
+
+  **Attribute Name**  | **Type**          | **Description**
+  ------------------- | :---------------: | ---------------------------------
+   **destination**    | LoggerDestination | Destination of logs. (Default: 
`file`)
+   **mode**           | LoggerMode        | Mode of the logger. (Default: 
`standard`)
+   **rotate**         | RotatePolicy      | An optional rotation policy.
+
+A RotatePolicy describes log rotation behavior for when `mode` is set to 
`rotate`. It is ignored
+otherwise.
+
+  **Attribute Name**  | **Type**     | **Description**
+  ------------------- | :----------: | ---------------------------------
+   **log_size**       | Integer      | Maximum size (in bytes) of an 
individual log file. (Default: 100 MiB)
+   **backups**        | Integer      | The maximum number of backups to 
retain. (Default: 5)
+
+An example process configuration is as follows:
+
+        process = Process(
+          name='process',
+          logger=Logger(
+            destination=LoggerDestination('both'),
+            mode=LoggerMode('rotate'),
+            rotate=RotatePolicy(log_size=5*MB, backups=5)
+          )
+        )
+
+Task Schema
+===========
+
+Tasks fundamentally consist of a `name` and a list of Process objects stored 
as the
+value of the `processes` attribute. Processes can be further constrained with
+`constraints`. By default, `name`'s value inherits from the first Process in 
the
+`processes` list, so for simple `Task` objects with one Process, `name`
+can be omitted. In Mesos, `resources` is also required.
+
+### Task Object
+
+   **param**               | **type**                         | **description**
+   ---------               | :---------:                      | ---------------
+   ```name```              | String                           | Process name 
(Required) (Default: ```processes0.name```)
+   ```processes```         | List of ```Process``` objects    | List of 
```Process``` objects bound to this task. (Required)
+   ```constraints```       | List of ```Constraint``` objects | List of 
```Constraint``` objects constraining processes.
+   ```resources```         | ```Resource``` object            | Resource 
footprint. (Required)
+   ```max_failures```      | Integer                          | Maximum 
process failures before being considered failed (Default: 1)
+   ```max_concurrency```   | Integer                          | Maximum number 
of concurrent processes (Default: 0, unlimited concurrency.)
+   ```finalization_wait``` | Integer                          | Amount of time 
allocated for finalizing processes, in seconds. (Default: 30)
+
+#### name
+`name` is a string denoting the name of this task. It defaults to the name of 
the first Process in
+the list of Processes associated with the `processes` attribute.
+
+#### processes
+
+`processes` is an unordered list of `Process` objects. To constrain the order
+in which they run, use `constraints`.
+
+##### constraints
+
+A list of `Constraint` objects. Currently it supports only one type,
+the `order` constraint. `order` is a list of process names
+that should run in the order given. For example,
+
+        process = Process(cmdline = "echo hello {{name}}")
+        task = Task(name = "echoes",
+                    processes = [process(name = "jim"), process(name = "bob")],
+                    constraints = [Constraint(order = ["jim", "bob"]))
+
+Constraints can be supplied ad-hoc and in duplicate. Not all
+Processes need be constrained, however Tasks with cycles are
+rejected by the Thermos scheduler.
+
+Use the `order` function as shorthand to generate `Constraint` lists.
+The following:
+
+        order(process1, process2)
+
+is shorthand for
+
+        [Constraint(order = [process1.name(), process2.name()])]
+
+The `order` function accepts Process name strings `('foo', 'bar')` or the 
processes
+themselves, e.g. `foo=Process(name='foo', ...)`, `bar=Process(name='bar', 
...)`,
+`constraints=order(foo, bar)`.
+
+#### resources
+
+Takes a `Resource` object, which specifies the amounts of CPU, memory, and 
disk space resources
+to allocate to the Task.
+
+#### max_failures
+
+`max_failures` is the number of failed processes needed for the `Task` to be
+marked as failed.
+
+For example, assume a Task has two Processes and a `max_failures` value of `2`:
+
+        template = Process(max_failures=10)
+        task = Task(
+          name = "fail",
+          processes = [
+             template(name = "failing", cmdline = "exit 1"),
+             template(name = "succeeding", cmdline = "exit 0")
+          ],
+          max_failures=2)
+
+The `failing` Process could fail 10 times before being marked as permanently
+failed, and the `succeeding` Process could succeed on the first run. However,
+the task would succeed despite only allowing for two failed processes. To be 
more
+specific, there would be 10 failed process runs yet 1 failed process. Both 
processes
+would have to fail for the Task to fail.
+
+#### max_concurrency
+
+For Tasks with a number of expensive but otherwise independent
+processes, you may want to limit the amount of concurrency
+the Thermos scheduler provides rather than artificially constraining
+it via `order` constraints. For example, a test framework may
+generate a task with 100 test run processes, but wants to run it on
+a machine with only 4 cores. You can limit the amount of parallelism to
+4 by setting `max_concurrency=4` in your task configuration.
+
+For example, the following task spawns 180 Processes ("mappers")
+to compute individual elements of a 180 degree sine table, all dependent
+upon one final Process ("reducer") to tabulate the results:
+
+    def make_mapper(id):
+      return Process(
+        name = "mapper%03d" % id,
+        cmdline = "echo 'scale=50;s(%d\*4\*a(1)/180)' | bc -l >
+                   temp.sine_table.%03d" % (id, id))
+
+    def make_reducer():
+      return Process(name = "reducer", cmdline = "cat temp.\* | nl \> 
sine\_table.txt
+                     && rm -f temp.\*")
+
+    processes = map(make_mapper, range(180))
+
+    task = Task(
+      name = "mapreduce",
+      processes = processes + [make\_reducer()],
+      constraints = [Constraint(order = [mapper.name(), 'reducer']) for mapper
+                     in processes],
+      max_concurrency = 8)
+
+#### finalization_wait
+
+Process execution is organizued into three active stages: `ACTIVE`,
+`CLEANING`, and `FINALIZING`. The `ACTIVE` stage is when ordinary processes 
run.
+This stage lasts as long as Processes are running and the Task is healthy.
+The moment either all Processes have finished successfully or the Task has 
reached a
+maximum Process failure limit, it goes into `CLEANING` stage and send
+SIGTERMs to all currently running Processes and their process trees.
+Once all Processes have terminated, the Task goes into `FINALIZING` stage
+and invokes the schedule of all Processes with the "final" attribute set to 
True.
+
+This whole process from the end of `ACTIVE` stage to the end of `FINALIZING`
+must happen within `finalization_wait` seconds. If it does not
+finish during that time, all remaining Processes are sent SIGKILLs
+(or if they depend upon uncompleted Processes, are
+never invoked.)
+
+When running on Aurora, the `finalization_wait` is capped at 60 seconds.
+
+### Constraint Object
+
+Current constraint objects only support a single ordering constraint, `order`,
+which specifies its processes run sequentially in the order given. By
+default, all processes run in parallel when bound to a `Task` without
+ordering constraints.
+
+   param | type           | description
+   ----- | :----:         | -----------
+   order | List of String | List of processes by name (String) that should be 
run serially.
+
+### Resource Object
+
+Specifies the amount of CPU, Ram, and disk resources the task needs. See the
+[Resource Isolation document](../features/resource-isolation.md) for suggested 
values and to understand how
+resources are allocated.
+
+  param      | type    | description
+  -----      | :----:  | -----------
+  ```cpu```  | Float   | Fractional number of cores required by the task.
+  ```ram```  | Integer | Bytes of RAM required by the task.
+  ```disk``` | Integer | Bytes of disk required by the task.
+
+
+Job Schema
+==========
+
+### Job Objects
+
+   name | type | description
+   ------ | :-------: | -------
+  ```task``` | Task | The Task object to bind to this job. Required.
+  ```name``` | String | Job name. (Default: inherited from the task 
attribute's name)
+  ```role``` | String | Job role account. Required.
+  ```cluster``` | String | Cluster in which this job is scheduled. Required.
+   ```environment``` | String | Job environment, default ```devel```. Must be 
one of ```prod```, ```devel```, ```test``` or ```staging<number>```.
+  ```contact``` | String | Best email address to reach the owner of the job. 
For production jobs, this is usually a team mailing list.
+  ```instances```| Integer | Number of instances (sometimes referred to as 
replicas or shards) of the task to create. (Default: 1)
+   ```cron_schedule``` | String | Cron schedule in cron format. May only be 
used with non-service jobs. See [Cron Jobs](cron-jobs.md) for more information. 
Default: None (not a cron job.)
+  ```cron_collision_policy``` | String | Policy to use when a cron job is 
triggered while a previous run is still active. KILL_EXISTING Kill the previous 
run, and schedule the new run CANCEL_NEW Let the previous run continue, and 
cancel the new run. (Default: KILL_EXISTING)
+  ```update_config``` | ```UpdateConfig``` object | Parameters for controlling 
the rate and policy of rolling updates.
+  ```constraints``` | dict | Scheduling constraints for the tasks. See the 
section on the [constraint specification 
language](#Specifying-Scheduling-Constraints)
+  ```service``` | Boolean | If True, restart tasks regardless of success or 
failure. (Default: False)
+  ```max_task_failures``` | Integer | Maximum number of failures after which 
the task is considered to have failed (Default: 1) Set to -1 to allow for 
infinite failures
+  ```priority``` | Integer | Preemption priority to give the task (Default 0). 
Tasks with higher priorities may preempt tasks at lower priorities.
+  ```production``` | Boolean |  Whether or not this is a production task that 
may [preempt](resources.md#task-preemption) other tasks (Default: False). 
Production job role must have the appropriate 
[quota](resources.md#resource-quota).
+  ```health_check_config``` | ```HealthCheckConfig``` object | Parameters for 
controlling a task's health checks. HTTP health check is only used if a  health 
port was assigned with a command line wildcard.
+  ```container``` | ```Container``` object | An optional container to run all 
processes inside of.
+  ```lifecycle``` | ```LifecycleConfig``` object | An optional task lifecycle 
configuration that dictates commands to be executed on startup/teardown.  HTTP 
lifecycle is enabled by default if the "health" port is requested.  See 
[LifecycleConfig Objects](#lifecycleconfig-objects) for more information.
+  ```tier``` | String | Task tier type. When set to `revocable` requires the 
task to run with Mesos revocable resources. This is work [in 
progress](https://issues.apache.org/jira/browse/AURORA-1343) and is currently 
only supported for the revocable tasks. The ultimate goal is to simplify task 
configuration by hiding various configuration knobs behind a task tier 
definition. See AURORA-1343 and AURORA-1443 for more details.
+
+
+### UpdateConfig Objects
+
+Parameters for controlling the rate and policy of rolling updates.
+
+| object                       | type     | description
+| ---------------------------- | :------: | ------------
+| ```batch_size```             | Integer  | Maximum number of shards to be 
updated in one iteration (Default: 1)
+| ```watch_secs```             | Integer  | Minimum number of seconds a shard 
must remain in ```RUNNING``` state before considered a success (Default: 45)
+| ```max_per_shard_failures``` | Integer  | Maximum number of restarts per 
shard during update. Increments total failure count when this limit is 
exceeded. (Default: 0)
+| ```max_total_failures```     | Integer  | Maximum number of shard failures 
to be tolerated in total during an update. Cannot be greater than or equal to 
the total number of tasks in a job. (Default: 0)
+| ```rollback_on_failure```    | boolean  | When False, prevents auto rollback 
of a failed update (Default: True)
+| ```wait_for_batch_completion```| boolean | When True, all threads from a 
given batch will be blocked from picking up new instances until the entire 
batch is updated. This essentially simulates the legacy sequential updater 
algorithm. (Default: False)
+| ```pulse_interval_secs```    | Integer  |  Indicates a [coordinated 
update](client-commands.md#user-content-coordinated-job-updates). If no pulses 
are received within the provided interval the update will be blocked. 
Beta-updater only. Will fail on submission when used with client updater. 
(Default: None)
+
+### HealthCheckConfig Objects
+
+*Note: ```endpoint```, ```expected_response``` and 
```expected_response_code``` are deprecated from ```HealthCheckConfig``` and 
must be definied in ```HttpHealthChecker```.*
+
+Parameters for controlling a task's health checks via HTTP or a shell command.
+
+| param                          | type      | description
+| -------                        | :-------: | --------
+| ```health_checker```           | HealthCheckerConfig | Configure what kind 
of health check to use.
+| ```initial_interval_secs```    | Integer   | Initial delay for performing a 
health check. (Default: 15)
+| ```interval_secs```            | Integer   | Interval on which to check the 
task's health. (Default: 10)
+| ```max_consecutive_failures``` | Integer   | Maximum number of consecutive 
failures that will be tolerated before considering a task unhealthy (Default: 0)
+| ```timeout_secs```             | Integer   | Health check timeout. (Default: 
1)
+
+### HealthCheckerConfig Objects
+| param                          | type                | description
+| -------                        | :-------:           | --------
+| ```http```                     | HttpHealthChecker  | Configure health check 
to use HTTP. (Default)
+| ```shell```                    | ShellHealthChecker | Configure health check 
via a shell command.
+
+### HttpHealthChecker Objects
+| param                          | type      | description
+| -------                        | :-------: | --------
+| ```endpoint```                 | String    | HTTP endpoint to check 
(Default: /health)
+| ```expected_response```        | String    | If not empty, fail the HTTP 
health check if the response differs. Case insensitive. (Default: ok)
+| ```expected_response_code```   | Integer   | If not zero, fail the HTTP 
health check if the response code differs. (Default: 0)
+
+### ShellHealthChecker Objects
+| param                          | type      | description
+| -------                        | :-------: | --------
+| ```shell_command```            | String    | An alternative to HTTP health 
checking. Specifies a shell command that will be executed. Any non-zero exit 
status will be interpreted as a health check failure.
+
+
+### Announcer Objects
+
+If the `announce` field in the Job configuration is set, each task will be
+registered in the ServerSet `/aurora/role/environment/jobname` in the
+zookeeper ensemble configured by the executor (which can be optionally 
overriden by specifying
+`zk_path` parameter).  If no Announcer object is specified,
+no announcement will take place.  For more information about ServerSets, see 
the [Service Discover](../features/service-discovery.md)
+documentation.
+
+By default, the hostname in the registered endpoints will be the `--hostname` 
parameter
+that is passed to the mesos slave. To override the hostname value, the 
executor can be started
+with `--announcer-hostname=<overriden_value>`. If you decide to use 
`--announcer-hostname` and if
+the overriden value needs to change for every executor, then the executor has 
to be started inside a wrapper, see [Executor 
Wrapper](../operations/configuration.md#thermos-executor-wrapper).
+
+For example, if you want the hostname in the endpoint to be an IP address 
instead of the hostname,
+the `--hostname` parameter to the mesos slave can be set to the machine IP or 
the executor can
+be started with `--announcer-hostname=<host_ip>` while wrapping the executor 
inside a script.
+
+| object                         | type      | description
+| -------                        | :-------: | --------
+| ```primary_port```             | String    | Which named port to register as 
the primary endpoint in the ServerSet (Default: `http`)
+| ```portmap```                  | dict      | A mapping of additional 
endpoints to be announced in the ServerSet (Default: `{ 'aurora': 
'{{primary_port}}' }`)
+| ```zk_path```                  | String    | Zookeeper serverset path 
override (executor must be started with the 
`--announcer-allow-custom-serverset-path` parameter)
+
+#### Port aliasing with the Announcer `portmap`
+
+The primary endpoint registered in the ServerSet is the one allocated to the 
port
+specified by the `primary_port` in the `Announcer` object, by default
+the `http` port.  This port can be referenced from anywhere within a 
configuration
+as `{{thermos.ports[http]}}`.
+
+Without the port map, each named port would be allocated a unique port number.
+The `portmap` allows two different named ports to be aliased together.  The 
default
+`portmap` aliases the `aurora` port (i.e. `{{thermos.ports[aurora]}}`) to
+the `http` port.  Even though the two ports can be referenced independently,
+only one port is allocated by Mesos.  Any port referenced in a `Process` object
+but which is not in the portmap will be allocated dynamically by Mesos and 
announced as well.
+
+It is possible to use the portmap to alias names to static port numbers, e.g.
+`{'http': 80, 'https': 443, 'aurora': 'http'}`.  In this case, referencing
+`{{thermos.ports[aurora]}}` would look up `{{thermos.ports[http]}}` then
+find a static port 80.  No port would be requested of or allocated by Mesos.
+
+Static ports should be used cautiously as Aurora does nothing to prevent two
+tasks with the same static port allocations from being co-scheduled.
+External constraints such as slave attributes should be used to enforce such
+guarantees should they be needed.
+
+### Container Objects
+
+*Note: The only container type currently supported is "docker".  Docker 
support is currently EXPERIMENTAL.*
+*Note: In order to correctly execute processes inside a job, the Docker 
container must have python 2.7 installed.*
+
+*Note: For private docker registry, mesos mandates the docker credential file 
to be named as `.dockercfg`, even though docker may create a credential file 
with a different name on various platforms. Also, the `.dockercfg` file needs 
to be copied into the sandbox using the `-thermos_executor_resources` flag, 
specified while starting Aurora.*
+
+Describes the container the job's processes will run inside.
+
+  param          | type           | description
+  -----          | :----:         | -----------
+  ```docker```   | Docker         | A docker container to use.
+
+### Docker Object
+
+  param            | type            | description
+  -----            | :----:          | -----------
+  ```image```      | String          | The name of the docker image to 
execute.  If the image does not exist locally it will be pulled with ```docker 
pull```.
+  ```parameters``` | List(Parameter) | Additional parameters to pass to the 
docker containerizer.
+
+### Docker Parameter Object
+
+Docker CLI parameters. This needs to be enabled by the scheduler 
`allow_docker_parameters` option.
+See [Docker Command Line 
Reference](https://docs.docker.com/reference/commandline/run/) for valid 
parameters.
+
+  param            | type            | description
+  -----            | :----:          | -----------
+  ```name```       | String          | The name of the docker parameter. E.g. 
volume
+  ```value```      | String          | The value of the parameter. E.g. 
/usr/local/bin:/usr/bin:rw
+
+### LifecycleConfig Objects
+
+*Note: The only lifecycle configuration supported is the HTTP lifecycle via 
the HttpLifecycleConfig.*
+
+  param          | type                | description
+  -----          | :----:              | -----------
+  ```http```     | HttpLifecycleConfig | Configure the lifecycle manager to 
send lifecycle commands to the task via HTTP.
+
+### HttpLifecycleConfig Objects
+
+  param          | type            | description
+  -----          | :----:          | -----------
+  ```port```     | String          | The named port to send POST commands 
(Default: health)
+  ```graceful_shutdown_endpoint``` | String | Endpoint to hit to indicate that 
a task should gracefully shutdown. (Default: /quitquitquit)
+  ```shutdown_endpoint``` | String | Endpoint to hit to give a task its final 
warning before being killed. (Default: /abortabortabort)
+
+#### graceful_shutdown_endpoint
+
+If the Job is listening on the port as specified by the HttpLifecycleConfig
+(default: `health`), a HTTP POST request will be sent over localhost to this
+endpoint to request that the task gracefully shut itself down.  This is a
+courtesy call before the `shutdown_endpoint` is invoked a fixed amount of
+time later.
+
+#### shutdown_endpoint
+
+If the Job is listening on the port as specified by the HttpLifecycleConfig
+(default: `health`), a HTTP POST request will be sent over localhost to this
+endpoint to request as a final warning before being shut down.  If the task
+does not shut down on its own after this, it will be forcefully killed
+
+
+Specifying Scheduling Constraints
+=================================
+
+In the `Job` object there is a map `constraints` from String to String
+allowing the user to tailor the schedulability of tasks within the job.
+
+The constraint map's key value is the attribute name in which we
+constrain Tasks within our Job. The value is how we constrain them.
+There are two types of constraints: *limit constraints* and *value
+constraints*.
+
+| constraint    | description
+| ------------- | --------------
+| Limit         | A string that specifies a limit for a constraint. Starts 
with <code>'limit:</code> followed by an Integer and closing single quote, such 
as ```'limit:1'```.
+| Value         | A string that specifies a value for a constraint. To include 
a list of values, separate the values using commas. To negate the values of a 
constraint, start with a ```!``` ```.```
+
+Further details can be found in the [Scheduling 
Constraints](../features/constraints) feature
+description.
+
+
+Template Namespaces
+===================
+
+Currently, a few Pystachio namespaces have special semantics. Using them
+in your configuration allow you to tailor application behavior
+through environment introspection or interact in special ways with the
+Aurora client or Aurora-provided services.
+
+### mesos Namespace
+
+The `mesos` namespace contains variables which relate to the `mesos` slave
+which launched the task. The `instance` variable can be used
+to distinguish between Task replicas.
+
+| variable name     | type       | description
+| --------------- | :--------: | -------------
+| ```instance```    | Integer    | The instance number of the created task. A 
job with 5 replicas has instance numbers 0, 1, 2, 3, and 4.
+| ```hostname``` | String | The instance hostname that the task was launched 
on.
+
+Please note, there is no uniqueness guarantee for `instance` in the presence of
+network partitions. If that is required, it should be baked in at the 
application
+level using a distributed coordination service such as Zookeeper.
+
+### thermos Namespace
+
+The `thermos` namespace contains variables that work directly on the
+Thermos platform in addition to Aurora. This namespace is fully
+compatible with Tasks invoked via the `thermos` CLI.
+
+| variable      | type                     | description                       
 |
+| :----------:  | ---------                | ------------                      
 |
+| ```ports```   | map of string to Integer | A map of names to port numbers    
 |
+| ```task_id``` | string                   | The task ID assigned to this 
task. |
+
+The `thermos.ports` namespace is automatically populated by Aurora when
+invoking tasks on Mesos. When running the `thermos` command directly,
+these ports must be explicitly mapped with the `-P` option.
+
+For example, if '{{`thermos.ports[http]`}}' is specified in a `Process`
+configuration, it is automatically extracted and auto-populated by
+Aurora, but must be specified with, for example, `thermos -P http:12345`
+to map `http` to port 12345 when running via the CLI.
+


http://git-wip-us.apache.org/repos/asf/aurora/blob/f28f41a7/docs/reference/scheduler-configuration.md
----------------------------------------------------------------------
diff --git a/docs/reference/scheduler-configuration.md 
b/docs/reference/scheduler-configuration.md
new file mode 100644
index 0000000..0b1e3c7
--- /dev/null
+++ b/docs/reference/scheduler-configuration.md
@@ -0,0 +1,318 @@
+# Scheduler Configuration
+
+The Aurora scheduler can take a variety of configuration options through 
command-line arguments.
+A list of the available options can be seen by running `aurora-scheduler 
-help`.
+
+Please refer to the [Operator Configuration 
Guide](../operations/configuration.md) for details on how
+to properly set the most important options.
+
+```
+$ aurora-scheduler -help
+-------------------------------------------------------------------------
+-h or -help to print this help message
+
+Required flags:
+-backup_dir [not null]
+    Directory to store backups under. Will be created if it does not exist.
+    (org.apache.aurora.scheduler.storage.backup.BackupModule.backup_dir)
+-cluster_name [not null]
+    Name to identify the cluster being served.
+    (org.apache.aurora.scheduler.app.SchedulerMain.cluster_name)
+-framework_authentication_file
+    Properties file which contains framework credentials to authenticate with 
Mesosmaster. Must contain the properties 'aurora_authentication_principal' and 
'aurora_authentication_secret'.
+    
(org.apache.aurora.scheduler.mesos.CommandLineDriverSettingsModule.framework_authentication_file)
+-mesos_master_address [not null]
+    Address for the mesos master, can be a socket address or zookeeper path.
+    
(org.apache.aurora.scheduler.mesos.CommandLineDriverSettingsModule.mesos_master_address)
+-mesos_role
+    The Mesos role this framework will register as. The default is to left 
this empty, and the framework will register without any role and only receive 
unreserved resources in offer.
+    
(org.apache.aurora.scheduler.mesos.CommandLineDriverSettingsModule.mesos_role)
+-serverset_path [not null, must be non-empty]
+    ZooKeeper ServerSet path to register at.
+    (org.apache.aurora.scheduler.app.SchedulerMain.serverset_path)
+-shiro_after_auth_filter
+    Fully qualified class name of the servlet filter to be applied after the 
shiro auth filters are applied.
+    
(org.apache.aurora.scheduler.http.api.security.HttpSecurityModule.shiro_after_auth_filter)
+-thermos_executor_path
+    Path to the thermos executor entry point.
+    
(org.apache.aurora.scheduler.configuration.executor.ExecutorModule.thermos_executor_path)
+-tier_config [file must be readable]
+    Configuration file defining supported task tiers, task traits and 
behaviors.
+    (org.apache.aurora.scheduler.SchedulerModule.tier_config)
+-zk_digest_credentials
+    user:password to use when authenticating with ZooKeeper.
+    
(org.apache.aurora.scheduler.zookeeper.guice.client.flagged.FlaggedClientConfig.zk_digest_credentials)
+-zk_endpoints [must have at least 1 item]
+    Endpoint specification for the ZooKeeper servers.
+    
(org.apache.aurora.scheduler.zookeeper.guice.client.flagged.FlaggedClientConfig.zk_endpoints)
+
+Optional flags:
+-allow_docker_parameters=false
+    Allow to pass docker container parameters in the job.
+    (org.apache.aurora.scheduler.app.AppModule.allow_docker_parameters)
+-allowed_container_types=[MESOS]
+    Container types that are allowed to be used by jobs.
+    (org.apache.aurora.scheduler.app.AppModule.allowed_container_types)
+-async_slot_stat_update_interval=(1, mins)
+    Interval on which to try to update open slot stats.
+    
(org.apache.aurora.scheduler.stats.AsyncStatsModule.async_slot_stat_update_interval)
+-async_task_stat_update_interval=(1, hrs)
+    Interval on which to try to update resource consumption stats.
+    
(org.apache.aurora.scheduler.stats.AsyncStatsModule.async_task_stat_update_interval)
+-async_worker_threads=8
+    The number of worker threads to process async task operations with.
+    (org.apache.aurora.scheduler.async.AsyncModule.async_worker_threads)
+-backup_interval=(1, hrs)
+    Minimum interval on which to write a storage backup.
+    (org.apache.aurora.scheduler.storage.backup.BackupModule.backup_interval)
+-cron_scheduler_num_threads=100
+    Number of threads to use for the cron scheduler thread pool.
+    
(org.apache.aurora.scheduler.cron.quartz.CronModule.cron_scheduler_num_threads)
+-cron_start_initial_backoff=(1, secs)
+    Initial backoff delay while waiting for a previous cron run to be killed.
+    
(org.apache.aurora.scheduler.cron.quartz.CronModule.cron_start_initial_backoff)
+-cron_start_max_backoff=(1, mins)
+    Max backoff delay while waiting for a previous cron run to be killed.
+    (org.apache.aurora.scheduler.cron.quartz.CronModule.cron_start_max_backoff)
+-cron_timezone=GMT
+    TimeZone to use for cron predictions.
+    (org.apache.aurora.scheduler.cron.quartz.CronModule.cron_timezone)
+-custom_executor_config [file must exist, file must be readable]
+    Path to custom executor settings configuration file.
+    
(org.apache.aurora.scheduler.configuration.executor.ExecutorModule.custom_executor_config)
+-db_lock_timeout=(1, mins)
+    H2 table lock timeout
+    (org.apache.aurora.scheduler.storage.db.DbModule.db_lock_timeout)
+-db_row_gc_interval=(2, hrs)
+    Interval on which to scan the database for unused row references.
+    (org.apache.aurora.scheduler.storage.db.DbModule.db_row_gc_interval)
+-default_docker_parameters={}
+    Default docker parameters for any job that does not explicitly declare 
parameters.
+    (org.apache.aurora.scheduler.app.AppModule.default_docker_parameters)
+-dlog_max_entry_size=(512, KB)
+    Specifies the maximum entry size to append to the log. Larger entries will 
be split across entry Frames.
+    
(org.apache.aurora.scheduler.storage.log.LogStorageModule.dlog_max_entry_size)
+-dlog_shutdown_grace_period=(2, secs)
+    Specifies the maximum time to wait for scheduled checkpoint and snapshot 
actions to complete before forcibly shutting down.
+    
(org.apache.aurora.scheduler.storage.log.LogStorageModule.dlog_shutdown_grace_period)
+-dlog_snapshot_interval=(1, hrs)
+    Specifies the frequency at which snapshots of local storage are taken and 
written to the log.
+    
(org.apache.aurora.scheduler.storage.log.LogStorageModule.dlog_snapshot_interval)
+-enable_cors_for
+    List of domains for which CORS support should be enabled.
+    (org.apache.aurora.scheduler.http.api.ApiModule.enable_cors_for)
+-enable_h2_console=false
+    Enable H2 DB management console.
+    (org.apache.aurora.scheduler.http.H2ConsoleModule.enable_h2_console)
+-enable_preemptor=true
+    Enable the preemptor and preemption
+    (org.apache.aurora.scheduler.preemptor.PreemptorModule.enable_preemptor)
+-executor_user=root
+    User to start the executor. Defaults to "root". Set this to an 
unprivileged user if the mesos master was started with "--no-root_submissions". 
If set to anything other than "root", the executor will ignore the "role" 
setting for jobs since it can't use setuid() anymore. This means that all your 
jobs will run under the specified user and the user has to exist on the mesos 
slaves.
+    
(org.apache.aurora.scheduler.mesos.CommandLineDriverSettingsModule.executor_user)
+-first_schedule_delay=(1, ms)
+    Initial amount of time to wait before first attempting to schedule a 
PENDING task.
+    
(org.apache.aurora.scheduler.scheduling.SchedulingModule.first_schedule_delay)
+-flapping_task_threshold=(5, mins)
+    A task that repeatedly runs for less than this time is considered to be 
flapping.
+    
(org.apache.aurora.scheduler.scheduling.SchedulingModule.flapping_task_threshold)
+-framework_announce_principal=false
+    When 'framework_authentication_file' flag is set, the FrameworkInfo 
registered with the mesos master will also contain the principal. This is 
necessary if you intend to use mesos authorization via mesos ACLs. The default 
will change in a future release.
+    
(org.apache.aurora.scheduler.mesos.CommandLineDriverSettingsModule.framework_announce_principal)
+-framework_failover_timeout=(21, days)
+    Time after which a framework is considered deleted.  SHOULD BE VERY HIGH.
+    
(org.apache.aurora.scheduler.mesos.CommandLineDriverSettingsModule.framework_failover_timeout)
+-global_container_mounts=[]
+    A comma seperated list of mount points (in host:container form) to mount 
into all (non-mesos) containers.
+    
(org.apache.aurora.scheduler.configuration.executor.ExecutorModule.global_container_mounts)
+-history_max_per_job_threshold=100
+    Maximum number of terminated tasks to retain in a job history.
+    
(org.apache.aurora.scheduler.pruning.PruningModule.history_max_per_job_threshold)
+-history_min_retention_threshold=(1, hrs)
+    Minimum guaranteed time for task history retention before any pruning is 
attempted.
+    
(org.apache.aurora.scheduler.pruning.PruningModule.history_min_retention_threshold)
+-history_prune_threshold=(2, days)
+    Time after which the scheduler will prune terminated task history.
+    (org.apache.aurora.scheduler.pruning.PruningModule.history_prune_threshold)
+-hostname
+    The hostname to advertise in ZooKeeper instead of the locally-resolved 
hostname.
+    (org.apache.aurora.scheduler.http.JettyServerModule.hostname)
+-http_authentication_mechanism=NONE
+    HTTP Authentication mechanism to use.
+    
(org.apache.aurora.scheduler.http.api.security.HttpSecurityModule.http_authentication_mechanism)
+-http_port=0
+    The port to start an HTTP server on.  Default value will choose a random 
port.
+    (org.apache.aurora.scheduler.http.JettyServerModule.http_port)
+-initial_flapping_task_delay=(30, secs)
+    Initial amount of time to wait before attempting to schedule a flapping 
task.
+    
(org.apache.aurora.scheduler.scheduling.SchedulingModule.initial_flapping_task_delay)
+-initial_schedule_penalty=(1, secs)
+    Initial amount of time to wait before attempting to schedule a task that 
has failed to schedule.
+    
(org.apache.aurora.scheduler.scheduling.SchedulingModule.initial_schedule_penalty)
+-initial_task_kill_retry_interval=(5, secs)
+    When killing a task, retry after this delay if mesos has not responded, 
backing off up to transient_task_state_timeout
+    
(org.apache.aurora.scheduler.reconciliation.ReconciliationModule.initial_task_kill_retry_interval)
+-job_update_history_per_job_threshold=10
+    Maximum number of completed job updates to retain in a job update history.
+    
(org.apache.aurora.scheduler.pruning.PruningModule.job_update_history_per_job_threshold)
+-job_update_history_pruning_interval=(15, mins)
+    Job update history pruning interval.
+    
(org.apache.aurora.scheduler.pruning.PruningModule.job_update_history_pruning_interval)
+-job_update_history_pruning_threshold=(30, days)
+    Time after which the scheduler will prune completed job update history.
+    
(org.apache.aurora.scheduler.pruning.PruningModule.job_update_history_pruning_threshold)
+-kerberos_debug=false
+    Produce additional Kerberos debugging output.
+    
(org.apache.aurora.scheduler.http.api.security.Kerberos5ShiroRealmModule.kerberos_debug)
+-kerberos_server_keytab
+    Path to the server keytab.
+    
(org.apache.aurora.scheduler.http.api.security.Kerberos5ShiroRealmModule.kerberos_server_keytab)
+-kerberos_server_principal
+    Kerberos server principal to use, usually of the form 
HTTP/aurora.example....@example.com
+    
(org.apache.aurora.scheduler.http.api.security.Kerberos5ShiroRealmModule.kerberos_server_principal)
+-max_flapping_task_delay=(5, mins)
+    Maximum delay between attempts to schedule a flapping task.
+    
(org.apache.aurora.scheduler.scheduling.SchedulingModule.max_flapping_task_delay)
+-max_leading_duration=(1, days)
+    After leading for this duration, the scheduler should commit suicide.
+    (org.apache.aurora.scheduler.SchedulerModule.max_leading_duration)
+-max_registration_delay=(1, mins)
+    Max allowable delay to allow the driver to register before aborting
+    (org.apache.aurora.scheduler.SchedulerModule.max_registration_delay)
+-max_reschedule_task_delay_on_startup=(30, secs)
+    Upper bound of random delay for pending task rescheduling on scheduler 
startup.
+    
(org.apache.aurora.scheduler.scheduling.SchedulingModule.max_reschedule_task_delay_on_startup)
+-max_saved_backups=48
+    Maximum number of backups to retain before deleting the oldest backups.
+    (org.apache.aurora.scheduler.storage.backup.BackupModule.max_saved_backups)
+-max_schedule_attempts_per_sec=40.0
+    Maximum number of scheduling attempts to make per second.
+    
(org.apache.aurora.scheduler.scheduling.SchedulingModule.max_schedule_attempts_per_sec)
+-max_schedule_penalty=(1, mins)
+    Maximum delay between attempts to schedule a PENDING tasks.
+    
(org.apache.aurora.scheduler.scheduling.SchedulingModule.max_schedule_penalty)
+-max_status_update_batch_size=1000 [must be > 0]
+    The maximum number of status updates that can be processed in a batch.
+    (org.apache.aurora.scheduler.SchedulerModule.max_status_update_batch_size)
+-max_tasks_per_job=4000 [must be > 0]
+    Maximum number of allowed tasks in a single job.
+    (org.apache.aurora.scheduler.app.AppModule.max_tasks_per_job)
+-max_update_instance_failures=20000 [must be > 0]
+    Upper limit on the number of failures allowed during a job update. This 
helps cap potentially unbounded entries into storage.
+    (org.apache.aurora.scheduler.app.AppModule.max_update_instance_failures)
+-min_offer_hold_time=(5, mins)
+    Minimum amount of time to hold a resource offer before declining.
+    (org.apache.aurora.scheduler.offers.OffersModule.min_offer_hold_time)
+-native_log_election_retries=20
+    The maximum number of attempts to obtain a new log writer.
+    
(org.apache.aurora.scheduler.log.mesos.MesosLogStreamModule.native_log_election_retries)
+-native_log_election_timeout=(15, secs)
+    The timeout for a single attempt to obtain a new log writer.
+    
(org.apache.aurora.scheduler.log.mesos.MesosLogStreamModule.native_log_election_timeout)
+-native_log_file_path
+    Path to a file to store the native log data in.  If the parent directory 
doesnot exist it will be created.
+    
(org.apache.aurora.scheduler.log.mesos.MesosLogStreamModule.native_log_file_path)
+-native_log_quorum_size=1
+    The size of the quorum required for all log mutations.
+    
(org.apache.aurora.scheduler.log.mesos.MesosLogStreamModule.native_log_quorum_size)
+-native_log_read_timeout=(5, secs)
+    The timeout for doing log reads.
+    
(org.apache.aurora.scheduler.log.mesos.MesosLogStreamModule.native_log_read_timeout)
+-native_log_write_timeout=(3, secs)
+    The timeout for doing log appends and truncations.
+    
(org.apache.aurora.scheduler.log.mesos.MesosLogStreamModule.native_log_write_timeout)
+-native_log_zk_group_path
+    A zookeeper node for use by the native log to track the master coordinator.
+    
(org.apache.aurora.scheduler.log.mesos.MesosLogStreamModule.native_log_zk_group_path)
+-offer_hold_jitter_window=(1, mins)
+    Maximum amount of random jitter to add to the offer hold time window.
+    (org.apache.aurora.scheduler.offers.OffersModule.offer_hold_jitter_window)
+-offer_reservation_duration=(3, mins)
+    Time to reserve a slave's offers while trying to satisfy a task preempting 
another.
+    
(org.apache.aurora.scheduler.scheduling.SchedulingModule.offer_reservation_duration)
+-preemption_delay=(3, mins)
+    Time interval after which a pending task becomes eligible to preempt other 
tasks
+    (org.apache.aurora.scheduler.preemptor.PreemptorModule.preemption_delay)
+-preemption_slot_hold_time=(5, mins)
+    Time to hold a preemption slot found before it is discarded.
+    
(org.apache.aurora.scheduler.preemptor.PreemptorModule.preemption_slot_hold_time)
+-preemption_slot_search_interval=(1, mins)
+    Time interval between pending task preemption slot searches.
+    
(org.apache.aurora.scheduler.preemptor.PreemptorModule.preemption_slot_search_interval)
+-receive_revocable_resources=false
+    Allows receiving revocable resource offers from Mesos.
+    
(org.apache.aurora.scheduler.mesos.CommandLineDriverSettingsModule.receive_revocable_resources)
+-reconciliation_explicit_interval=(60, mins)
+    Interval on which scheduler will ask Mesos for status updates of all 
non-terminal tasks known to scheduler.
+    
(org.apache.aurora.scheduler.reconciliation.ReconciliationModule.reconciliation_explicit_interval)
+-reconciliation_implicit_interval=(60, mins)
+    Interval on which scheduler will ask Mesos for status updates of all 
non-terminal tasks known to Mesos.
+    
(org.apache.aurora.scheduler.reconciliation.ReconciliationModule.reconciliation_implicit_interval)
+-reconciliation_initial_delay=(1, mins)
+    Initial amount of time to delay task reconciliation after scheduler start 
up.
+    
(org.apache.aurora.scheduler.reconciliation.ReconciliationModule.reconciliation_initial_delay)
+-reconciliation_schedule_spread=(30, mins)
+    Difference between explicit and implicit reconciliation intervals intended 
to create a non-overlapping task reconciliation schedule.
+    
(org.apache.aurora.scheduler.reconciliation.ReconciliationModule.reconciliation_schedule_spread)
+-shiro_ini_path
+    Path to shiro.ini for authentication and authorization configuration.
+    
(org.apache.aurora.scheduler.http.api.security.IniShiroRealmModule.shiro_ini_path)
+-shiro_realm_modules=[org.apache.aurora.scheduler.app.MoreModules$1@30c15d8b]
+    Guice modules for configuring Shiro Realms.
+    
(org.apache.aurora.scheduler.http.api.security.HttpSecurityModule.shiro_realm_modules)
+-sla_non_prod_metrics=[]
+    Metric categories collected for non production tasks.
+    (org.apache.aurora.scheduler.sla.SlaModule.sla_non_prod_metrics)
+-sla_prod_metrics=[JOB_UPTIMES, PLATFORM_UPTIME, MEDIANS]
+    Metric categories collected for production tasks.
+    (org.apache.aurora.scheduler.sla.SlaModule.sla_prod_metrics)
+-sla_stat_refresh_interval=(1, mins)
+    The SLA stat refresh interval.
+    (org.apache.aurora.scheduler.sla.SlaModule.sla_stat_refresh_interval)
+-slow_query_log_threshold=(25, ms)
+    Log all queries that take at least this long to execute.
+    
(org.apache.aurora.scheduler.storage.mem.InMemStoresModule.slow_query_log_threshold)
+-slow_query_log_threshold=(25, ms)
+    Log all queries that take at least this long to execute.
+    (org.apache.aurora.scheduler.storage.db.DbModule.slow_query_log_threshold)
+-stat_retention_period=(1, hrs)
+    Time for a stat to be retained in memory before expiring.
+    (org.apache.aurora.scheduler.stats.StatsModule.stat_retention_period)
+-stat_sampling_interval=(1, secs)
+    Statistic value sampling interval.
+    (org.apache.aurora.scheduler.stats.StatsModule.stat_sampling_interval)
+-thermos_executor_cpu=0.25
+    The number of CPU cores to allocate for each instance of the executor.
+    
(org.apache.aurora.scheduler.configuration.executor.ExecutorModule.thermos_executor_cpu)
+-thermos_executor_flags
+    Extra arguments to be passed to the thermos executor
+    
(org.apache.aurora.scheduler.configuration.executor.ExecutorModule.thermos_executor_flags)
+-thermos_executor_ram=(128, MB)
+    The amount of RAM to allocate for each instance of the executor.
+    
(org.apache.aurora.scheduler.configuration.executor.ExecutorModule.thermos_executor_ram)
+-thermos_executor_resources=[]
+    A comma seperated list of additional resources to copy into the 
sandbox.Note: if thermos_executor_path is not the thermos_executor.pex file 
itself, this must include it.
+    
(org.apache.aurora.scheduler.configuration.executor.ExecutorModule.thermos_executor_resources)
+-thermos_observer_root=/var/run/thermos
+    Path to the thermos observer root (by default /var/run/thermos.)
+    
(org.apache.aurora.scheduler.configuration.executor.ExecutorModule.thermos_observer_root)
+-transient_task_state_timeout=(5, mins)
+    The amount of time after which to treat a task stuck in a transient state 
as LOST.
+    
(org.apache.aurora.scheduler.reconciliation.ReconciliationModule.transient_task_state_timeout)
+-use_beta_db_task_store=false
+    Whether to use the experimental database-backed task store.
+    (org.apache.aurora.scheduler.storage.db.DbModule.use_beta_db_task_store)
+-viz_job_url_prefix=
+    URL prefix for job container stats.
+    (org.apache.aurora.scheduler.app.SchedulerMain.viz_job_url_prefix)
+-zk_chroot_path
+    chroot path to use for the ZooKeeper connections
+    
(org.apache.aurora.scheduler.zookeeper.guice.client.flagged.FlaggedClientConfig.zk_chroot_path)
+-zk_in_proc=false
+    Launches an embedded zookeeper server for local testing causing 
-zk_endpoints to be ignored if specified.
+    
(org.apache.aurora.scheduler.zookeeper.guice.client.flagged.FlaggedClientConfig.zk_in_proc)
+-zk_session_timeout=(4, secs)
+    The ZooKeeper session timeout.
+    
(org.apache.aurora.scheduler.zookeeper.guice.client.flagged.FlaggedClientConfig.zk_session_timeout)
+-------------------------------------------------------------------------
+```

http://git-wip-us.apache.org/repos/asf/aurora/blob/f28f41a7/docs/reference/task-lifecycle.md
----------------------------------------------------------------------
diff --git a/docs/reference/task-lifecycle.md b/docs/reference/task-lifecycle.md
new file mode 100644
index 0000000..1477364
--- /dev/null
+++ b/docs/reference/task-lifecycle.md
@@ -0,0 +1,146 @@
+# Task Lifecycle
+
+When Aurora reads a configuration file and finds a `Job` definition, it:
+
+1.  Evaluates the `Job` definition.
+2.  Splits the `Job` into its constituent `Task`s.
+3.  Sends those `Task`s to the scheduler.
+4.  The scheduler puts the `Task`s into `PENDING` state, starting each
+    `Task`'s life cycle.
+
+
+![Life of a task](../images/lifeofatask.png)
+
+Please note, a couple of task states described below are missing from
+this state diagram.
+
+
+## PENDING to RUNNING states
+
+When a `Task` is in the `PENDING` state, the scheduler constantly
+searches for machines satisfying that `Task`'s resource request
+requirements (RAM, disk space, CPU time) while maintaining configuration
+constraints such as "a `Task` must run on machines  dedicated  to a
+particular role" or attribute limit constraints such as "at most 2
+`Task`s from the same `Job` may run on each rack". When the scheduler
+finds a suitable match, it assigns the `Task` to a machine and puts the
+`Task` into the `ASSIGNED` state.
+
+From the `ASSIGNED` state, the scheduler sends an RPC to the slave
+machine containing `Task` configuration, which the slave uses to spawn
+an executor responsible for the `Task`'s lifecycle. When the scheduler
+receives an acknowledgment that the machine has accepted the `Task`,
+the `Task` goes into `STARTING` state.
+
+`STARTING` state initializes a `Task` sandbox. When the sandbox is fully
+initialized, Thermos begins to invoke `Process`es. Also, the slave
+machine sends an update to the scheduler that the `Task` is
+in `RUNNING` state.
+
+
+
+## RUNNING to terminal states
+
+There are various ways that an active `Task` can transition into a terminal
+state. By definition, it can never leave this state. However, depending on
+nature of the termination and the originating `Job` definition
+(e.g. `service`, `max_task_failures`), a replacement `Task` might be
+scheduled.
+
+### Natural Termination: FINISHED, FAILED
+
+A `RUNNING` `Task` can terminate without direct user interaction. For
+example, it may be a finite computation that finishes, even something as
+simple as `echo hello world.`, or it could be an exceptional condition in
+a long-lived service. If the `Task` is successful (its underlying
+processes have succeeded with exit status `0` or finished without
+reaching failure limits) it moves into `FINISHED` state. If it finished
+after reaching a set of failure limits, it goes into `FAILED` state.
+
+A terminated `TASK` which is subject to rescheduling will be temporarily
+`THROTTLED`, if it is considered to be flapping. A task is flapping, if its
+previous invocation was terminated after less than 5 minutes (scheduler
+default). The time penalty a task has to remain in the `THROTTLED` state,
+before it is eligible for rescheduling, increases with each consecutive
+failure.
+
+### Forceful Termination: KILLING, RESTARTING
+
+You can terminate a `Task` by issuing an `aurora job kill` command, which
+moves it into `KILLING` state. The scheduler then sends the slave a
+request to terminate the `Task`. If the scheduler receives a successful
+response, it moves the Task into `KILLED` state and never restarts it.
+
+If a `Task` is forced into the `RESTARTING` state via the `aurora job restart`
+command, the scheduler kills the underlying task but in parallel schedules
+an identical replacement for it.
+
+In any case, the responsible executor on the slave follows an escalation
+sequence when killing a running task:
+
+  1. If a `HttpLifecycleConfig` is not present, skip to (4).
+  2. Send a POST to the `graceful_shutdown_endpoint` and wait 5 seconds.
+  3. Send a POST to the `shutdown_endpoint` and wait 5 seconds.
+  4. Send SIGTERM (`kill`) and wait at most `finalization_wait` seconds.
+  5. Send SIGKILL (`kill -9`).
+
+If the executor notices that all `Process`es in a `Task` have aborted
+during this sequence, it will not proceed with subsequent steps.
+Note that graceful shutdown is best-effort, and due to the many
+inevitable realities of distributed systems, it may not be performed.
+
+### Unexpected Termination: LOST
+
+If a `Task` stays in a transient task state for too long (such as `ASSIGNED`
+or `STARTING`), the scheduler forces it into `LOST` state, creating a new
+`Task` in its place that's sent into `PENDING` state.
+
+In addition, if the Mesos core tells the scheduler that a slave has
+become unhealthy (or outright disappeared), the `Task`s assigned to that
+slave go into `LOST` state and new `Task`s are created in their place.
+From `PENDING` state, there is no guarantee a `Task` will be reassigned
+to the same machine unless job constraints explicitly force it there.
+
+### Giving Priority to Production Tasks: PREEMPTING
+
+Sometimes a Task needs to be interrupted, such as when a non-production
+Task's resources are needed by a higher priority production Task. This
+type of interruption is called a *pre-emption*. When this happens in
+Aurora, the non-production Task is killed and moved into
+the `PREEMPTING` state  when both the following are true:
+
+- The task being killed is a non-production task.
+- The other task is a `PENDING` production task that hasn't been
+  scheduled due to a lack of resources.
+
+The scheduler UI shows the non-production task was preempted in favor of
+the production task. At some point, tasks in `PREEMPTING` move to `KILLED`.
+
+Note that non-production tasks consuming many resources are likely to be
+preempted in favor of production tasks.
+
+### Making Room for Maintenance: DRAINING
+
+Cluster operators can set slave into maintenance mode. This will transition
+all `Task` running on this slave into `DRAINING` and eventually to `KILLED`.
+Drained `Task`s will be restarted on other slaves for which no maintenance
+has been announced yet.
+
+
+
+## State Reconciliation
+
+Due to the many inevitable realities of distributed systems, there might
+be a mismatch of perceived and actual cluster state (e.g. a machine returns
+from a `netsplit` but the scheduler has already marked all its `Task`s as
+`LOST` and rescheduled them).
+
+Aurora regularly runs a state reconciliation process in order to detect
+and correct such issues (e.g. by killing the errant `RUNNING` tasks).
+By default, the proper detection of all failure scenarios and inconsistencies
+may take up to an hour.
+
+To emphasize this point: there is no uniqueness guarantee for a single
+instance of a job in the presence of network partitions. If the `Task`
+requires that, it should be baked in at the application level using a
+distributed coordination service such as Zookeeper.

http://git-wip-us.apache.org/repos/asf/aurora/blob/f28f41a7/docs/resources.md
----------------------------------------------------------------------
diff --git a/docs/resources.md b/docs/resources.md
deleted file mode 100644
index 27a2678..0000000
--- a/docs/resources.md
+++ /dev/null
@@ -1,164 +0,0 @@
-Resources and Sizing
-=============================
-
-- [Introduction](#introduction)
-- [CPU Isolation](#cpu-isolation)
-- [CPU Sizing](#cpu-sizing)
-- [Memory Isolation](#memory-isolation)
-- [Memory Sizing](#memory-sizing)
-- [Disk Space](#disk-space)
-- [Disk Space Sizing](#disk-space-sizing)
-- [Other Resources](#other-resources)
-- [Resource Quota](#resource-quota)
-- [Task Preemption](#task-preemption)
-
-## Introduction
-
-Aurora is a multi-tenant system; a single software instance runs on a
-server, serving multiple clients/tenants. To share resources among
-tenants, it implements isolation of:
-
-* CPU
-* memory
-* disk space
-
-CPU is a soft limit, and handled differently from memory and disk space.
-Too low a CPU value results in throttling your application and
-slowing it down. Memory and disk space are both hard limits; when your
-application goes over these values, it's killed.
-
-Let's look at each resource type in more detail:
-
-## CPU Isolation
-
-Mesos uses a quota based CPU scheduler (the *Completely Fair Scheduler*)
-to provide consistent and predictable performance.  This is effectively
-a guarantee of resources -- you receive at least what you requested, but
-also no more than you've requested.
-
-The scheduler gives applications a CPU quota for every 100 ms interval.
-When an application uses its quota for an interval, it is throttled for
-the rest of the 100 ms. Usage resets for each interval and unused
-quota does not carry over.
-
-For example, an application specifying 4.0 CPU has access to 400 ms of
-CPU time every 100 ms. This CPU quota can be used in different ways,
-depending on the application and available resources. Consider the
-scenarios shown in this diagram.
-
-![CPU Availability](images/CPUavailability.png)
-
-* *Scenario A*: the application can use up to 4 cores continuously for
-every 100 ms interval. It is never throttled and starts processing
-new requests immediately.
-
-* *Scenario B* : the application uses up to 8 cores (depending on
-availability) but is throttled after 50 ms. The CPU quota resets at the
-start of each new 100 ms interval.
-
-* *Scenario C* : is like Scenario A, but there is a garbage collection
-event in the second interval that consumes all CPU quota. The
-application throttles for the remaining 75 ms of that interval and
-cannot service requests until the next interval. In this example, the
-garbage collection finished in one interval but, depending on how much
-garbage needs collecting, it may take more than one interval and further
-delay service of requests.
-
-*Technical Note*: Mesos considers logical cores, also known as
-hyperthreading or SMT cores, as the unit of CPU.
-
-## CPU Sizing
-
-To correctly size Aurora-run Mesos tasks, specify a per-shard CPU value
-that lets the task run at its desired performance when at peak load
-distributed across all shards. Include reserve capacity of at least 50%,
-possibly more, depending on how critical your service is (or how
-confident you are about your original estimate : -)), ideally by
-increasing the number of shards to also improve resiliency. When running
-your application, observe its CPU stats over time. If consistently at or
-near your quota during peak load, you should consider increasing either
-per-shard CPU or the number of shards.
-
-## Memory Isolation
-
-Mesos uses dedicated memory allocation. Your application always has
-access to the amount of memory specified in your configuration. The
-application's memory use is defined as the sum of the resident set size
-(RSS) of all processes in a shard. Each shard is considered
-independently.
-
-In other words, say you specified a memory size of 10GB. Each shard
-would receive 10GB of memory. If an individual shard's memory demands
-exceed 10GB, that shard is killed, but the other shards continue
-working.
-
-*Technical note*: Total memory size is not enforced at allocation time,
-so your application can request more than its allocation without getting
-an ENOMEM. However, it will be killed shortly after.
-
-## Memory Sizing
-
-Size for your application's peak requirement. Observe the per-instance
-memory statistics over time, as memory requirements can vary over
-different periods. Remember that if your application exceeds its memory
-value, it will be killed, so you should also add a safety margin of
-around 10-20%. If you have the ability to do so, you may also want to
-put alerts on the per-instance memory.
-
-## Disk Space
-
-Disk space used by your application is defined as the sum of the files'
-disk space in your application's directory, including the `stdout` and
-`stderr` logged from your application. Each shard is considered
-independently. You should use off-node storage for your application's
-data whenever possible.
-
-In other words, say you specified disk space size of 100MB. Each shard
-would receive 100MB of disk space. If an individual shard's disk space
-demands exceed 100MB, that shard is killed, but the other shards
-continue working.
-
-After your application finishes running, its allocated disk space is
-reclaimed. Thus, your job's final action should move any disk content
-that you want to keep, such as logs, to your home file system or other
-less transitory storage. Disk reclamation takes place an undefined
-period after the application finish time; until then, the disk contents
-are still available but you shouldn't count on them being so.
-
-*Technical note* : Disk space is not enforced at write so your
-application can write above its quota without getting an ENOSPC, but it
-will be killed shortly after. This is subject to change.
-
-## Disk Space Sizing
-
-Size for your application's peak requirement. Rotate and discard log
-files as needed to stay within your quota. When running a Java process,
-add the maximum size of the Java heap to your disk space requirement, in
-order to account for an out of memory error dumping the heap
-into the application's sandbox space.
-
-## Other Resources
-
-Other resources, such as network bandwidth, do not have any performance
-guarantees. For some resources, such as memory bandwidth, there are no
-practical sharing methods so some application combinations collocated on
-the same host may cause contention.
-
-## Resource Quota
-
-Aurora requires resource quotas for
-[production non-dedicated jobs](configuration-reference.md#job-objects). Quota 
is enforced at
-the job role level and when set, defines a non-preemptible pool of compute 
resources within
-that role.
-
-To grant quota to a particular role in production use `aurora_admin set_quota` 
command.
-
-NOTE: all job types (service, adhoc or cron) require role resource quota 
unless a job has
-[dedicated constraint set](deploying-aurora-scheduler.md#dedicated-attribute).
-
-## Task preemption
-
-Under a particular resource shortage pressure, tasks from
-[production](configuration-reference.md#job-objects) jobs may preempt tasks 
from any non-production
-job. A production task may only be preempted by tasks from production jobs in 
the same role with
-higher [priority](configuration-reference.md#job-objects).
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/aurora/blob/f28f41a7/docs/scheduler-configuration.md
----------------------------------------------------------------------
diff --git a/docs/scheduler-configuration.md b/docs/scheduler-configuration.md
deleted file mode 100644
index 7e3d801..0000000
--- a/docs/scheduler-configuration.md
+++ /dev/null
@@ -1,318 +0,0 @@
-# Scheduler Configuration
-
-The Aurora scheduler can take a variety of configuration options through 
command-line arguments.
-A list of the available options can be seen by running `aurora-scheduler 
-help`.
-
-Please refer to [Deploying the Aurora 
Scheduler](deploying-aurora-scheduler.md) for details on how
-to properly set the most important options.
-
-```
-$ aurora-scheduler -help
--------------------------------------------------------------------------
--h or -help to print this help message
-
-Required flags:
--backup_dir [not null]
-       Directory to store backups under. Will be created if it does not exist.
-       (org.apache.aurora.scheduler.storage.backup.BackupModule.backup_dir)
--cluster_name [not null]
-       Name to identify the cluster being served.
-       (org.apache.aurora.scheduler.app.SchedulerMain.cluster_name)
--framework_authentication_file
-       Properties file which contains framework credentials to authenticate 
with Mesosmaster. Must contain the properties 'aurora_authentication_principal' 
and 'aurora_authentication_secret'.
-       
(org.apache.aurora.scheduler.mesos.CommandLineDriverSettingsModule.framework_authentication_file)
--mesos_master_address [not null]
-       Address for the mesos master, can be a socket address or zookeeper path.
-       
(org.apache.aurora.scheduler.mesos.CommandLineDriverSettingsModule.mesos_master_address)
--mesos_role
-       The Mesos role this framework will register as. The default is to left 
this empty, and the framework will register without any role and only receive 
unreserved resources in offer.
-       
(org.apache.aurora.scheduler.mesos.CommandLineDriverSettingsModule.mesos_role)
--serverset_path [not null, must be non-empty]
-       ZooKeeper ServerSet path to register at.
-       (org.apache.aurora.scheduler.app.SchedulerMain.serverset_path)
--shiro_after_auth_filter
-       Fully qualified class name of the servlet filter to be applied after 
the shiro auth filters are applied.
-       
(org.apache.aurora.scheduler.http.api.security.HttpSecurityModule.shiro_after_auth_filter)
--thermos_executor_path
-       Path to the thermos executor entry point.
-       
(org.apache.aurora.scheduler.configuration.executor.ExecutorModule.thermos_executor_path)
--tier_config [file must be readable]
-       Configuration file defining supported task tiers, task traits and 
behaviors.
-       (org.apache.aurora.scheduler.SchedulerModule.tier_config)
--zk_digest_credentials
-       user:password to use when authenticating with ZooKeeper.
-       
(org.apache.aurora.scheduler.zookeeper.guice.client.flagged.FlaggedClientConfig.zk_digest_credentials)
--zk_endpoints [must have at least 1 item]
-       Endpoint specification for the ZooKeeper servers.
-       
(org.apache.aurora.scheduler.zookeeper.guice.client.flagged.FlaggedClientConfig.zk_endpoints)
-
-Optional flags:
--allow_docker_parameters=false
-       Allow to pass docker container parameters in the job.
-       (org.apache.aurora.scheduler.app.AppModule.allow_docker_parameters)
--allowed_container_types=[MESOS]
-       Container types that are allowed to be used by jobs.
-       (org.apache.aurora.scheduler.app.AppModule.allowed_container_types)
--async_slot_stat_update_interval=(1, mins)
-       Interval on which to try to update open slot stats.
-       
(org.apache.aurora.scheduler.stats.AsyncStatsModule.async_slot_stat_update_interval)
--async_task_stat_update_interval=(1, hrs)
-       Interval on which to try to update resource consumption stats.
-       
(org.apache.aurora.scheduler.stats.AsyncStatsModule.async_task_stat_update_interval)
--async_worker_threads=8
-       The number of worker threads to process async task operations with.
-       (org.apache.aurora.scheduler.async.AsyncModule.async_worker_threads)
--backup_interval=(1, hrs)
-       Minimum interval on which to write a storage backup.
-       
(org.apache.aurora.scheduler.storage.backup.BackupModule.backup_interval)
--cron_scheduler_num_threads=100
-       Number of threads to use for the cron scheduler thread pool.
-       
(org.apache.aurora.scheduler.cron.quartz.CronModule.cron_scheduler_num_threads)
--cron_start_initial_backoff=(1, secs)
-       Initial backoff delay while waiting for a previous cron run to be 
killed.
-       
(org.apache.aurora.scheduler.cron.quartz.CronModule.cron_start_initial_backoff)
--cron_start_max_backoff=(1, mins)
-       Max backoff delay while waiting for a previous cron run to be killed.
-       
(org.apache.aurora.scheduler.cron.quartz.CronModule.cron_start_max_backoff)
--cron_timezone=GMT
-       TimeZone to use for cron predictions.
-       (org.apache.aurora.scheduler.cron.quartz.CronModule.cron_timezone)
--custom_executor_config [file must exist, file must be readable]
-       Path to custom executor settings configuration file.
-       
(org.apache.aurora.scheduler.configuration.executor.ExecutorModule.custom_executor_config)
--db_lock_timeout=(1, mins)
-       H2 table lock timeout
-       (org.apache.aurora.scheduler.storage.db.DbModule.db_lock_timeout)
--db_row_gc_interval=(2, hrs)
-       Interval on which to scan the database for unused row references.
-       (org.apache.aurora.scheduler.storage.db.DbModule.db_row_gc_interval)
--default_docker_parameters={}
-       Default docker parameters for any job that does not explicitly declare 
parameters.
-       (org.apache.aurora.scheduler.app.AppModule.default_docker_parameters)
--dlog_max_entry_size=(512, KB)
-       Specifies the maximum entry size to append to the log. Larger entries 
will be split across entry Frames.
-       
(org.apache.aurora.scheduler.storage.log.LogStorageModule.dlog_max_entry_size)
--dlog_shutdown_grace_period=(2, secs)
-       Specifies the maximum time to wait for scheduled checkpoint and 
snapshot actions to complete before forcibly shutting down.
-       
(org.apache.aurora.scheduler.storage.log.LogStorageModule.dlog_shutdown_grace_period)
--dlog_snapshot_interval=(1, hrs)
-       Specifies the frequency at which snapshots of local storage are taken 
and written to the log.
-       
(org.apache.aurora.scheduler.storage.log.LogStorageModule.dlog_snapshot_interval)
--enable_cors_for
-       List of domains for which CORS support should be enabled.
-       (org.apache.aurora.scheduler.http.api.ApiModule.enable_cors_for)
--enable_h2_console=false
-       Enable H2 DB management console.
-       (org.apache.aurora.scheduler.http.H2ConsoleModule.enable_h2_console)
--enable_preemptor=true
-       Enable the preemptor and preemption
-       (org.apache.aurora.scheduler.preemptor.PreemptorModule.enable_preemptor)
--executor_user=root
-       User to start the executor. Defaults to "root". Set this to an 
unprivileged user if the mesos master was started with "--no-root_submissions". 
If set to anything other than "root", the executor will ignore the "role" 
setting for jobs since it can't use setuid() anymore. This means that all your 
jobs will run under the specified user and the user has to exist on the mesos 
slaves.
-       
(org.apache.aurora.scheduler.mesos.CommandLineDriverSettingsModule.executor_user)
--first_schedule_delay=(1, ms)
-       Initial amount of time to wait before first attempting to schedule a 
PENDING task.
-       
(org.apache.aurora.scheduler.scheduling.SchedulingModule.first_schedule_delay)
--flapping_task_threshold=(5, mins)
-       A task that repeatedly runs for less than this time is considered to be 
flapping.
-       
(org.apache.aurora.scheduler.scheduling.SchedulingModule.flapping_task_threshold)
--framework_announce_principal=false
-       When 'framework_authentication_file' flag is set, the FrameworkInfo 
registered with the mesos master will also contain the principal. This is 
necessary if you intend to use mesos authorization via mesos ACLs. The default 
will change in a future release.
-       
(org.apache.aurora.scheduler.mesos.CommandLineDriverSettingsModule.framework_announce_principal)
--framework_failover_timeout=(21, days)
-       Time after which a framework is considered deleted.  SHOULD BE VERY 
HIGH.
-       
(org.apache.aurora.scheduler.mesos.CommandLineDriverSettingsModule.framework_failover_timeout)
--global_container_mounts=[]
-       A comma seperated list of mount points (in host:container form) to 
mount into all (non-mesos) containers.
-       
(org.apache.aurora.scheduler.configuration.executor.ExecutorModule.global_container_mounts)
--history_max_per_job_threshold=100
-       Maximum number of terminated tasks to retain in a job history.
-       
(org.apache.aurora.scheduler.pruning.PruningModule.history_max_per_job_threshold)
--history_min_retention_threshold=(1, hrs)
-       Minimum guaranteed time for task history retention before any pruning 
is attempted.
-       
(org.apache.aurora.scheduler.pruning.PruningModule.history_min_retention_threshold)
--history_prune_threshold=(2, days)
-       Time after which the scheduler will prune terminated task history.
-       
(org.apache.aurora.scheduler.pruning.PruningModule.history_prune_threshold)
--hostname
-       The hostname to advertise in ZooKeeper instead of the locally-resolved 
hostname.
-       (org.apache.aurora.scheduler.http.JettyServerModule.hostname)
--http_authentication_mechanism=NONE
-       HTTP Authentication mechanism to use.
-       
(org.apache.aurora.scheduler.http.api.security.HttpSecurityModule.http_authentication_mechanism)
--http_port=0
-       The port to start an HTTP server on.  Default value will choose a 
random port.
-       (org.apache.aurora.scheduler.http.JettyServerModule.http_port)
--initial_flapping_task_delay=(30, secs)
-       Initial amount of time to wait before attempting to schedule a flapping 
task.
-       
(org.apache.aurora.scheduler.scheduling.SchedulingModule.initial_flapping_task_delay)
--initial_schedule_penalty=(1, secs)
-       Initial amount of time to wait before attempting to schedule a task 
that has failed to schedule.
-       
(org.apache.aurora.scheduler.scheduling.SchedulingModule.initial_schedule_penalty)
--initial_task_kill_retry_interval=(5, secs)
-       When killing a task, retry after this delay if mesos has not responded, 
backing off up to transient_task_state_timeout
-       
(org.apache.aurora.scheduler.reconciliation.ReconciliationModule.initial_task_kill_retry_interval)
--job_update_history_per_job_threshold=10
-       Maximum number of completed job updates to retain in a job update 
history.
-       
(org.apache.aurora.scheduler.pruning.PruningModule.job_update_history_per_job_threshold)
--job_update_history_pruning_interval=(15, mins)
-       Job update history pruning interval.
-       
(org.apache.aurora.scheduler.pruning.PruningModule.job_update_history_pruning_interval)
--job_update_history_pruning_threshold=(30, days)
-       Time after which the scheduler will prune completed job update history.
-       
(org.apache.aurora.scheduler.pruning.PruningModule.job_update_history_pruning_threshold)
--kerberos_debug=false
-       Produce additional Kerberos debugging output.
-       
(org.apache.aurora.scheduler.http.api.security.Kerberos5ShiroRealmModule.kerberos_debug)
--kerberos_server_keytab
-       Path to the server keytab.
-       
(org.apache.aurora.scheduler.http.api.security.Kerberos5ShiroRealmModule.kerberos_server_keytab)
--kerberos_server_principal
-       Kerberos server principal to use, usually of the form 
HTTP/aurora.example....@example.com
-       
(org.apache.aurora.scheduler.http.api.security.Kerberos5ShiroRealmModule.kerberos_server_principal)
--max_flapping_task_delay=(5, mins)
-       Maximum delay between attempts to schedule a flapping task.
-       
(org.apache.aurora.scheduler.scheduling.SchedulingModule.max_flapping_task_delay)
--max_leading_duration=(1, days)
-       After leading for this duration, the scheduler should commit suicide.
-       (org.apache.aurora.scheduler.SchedulerModule.max_leading_duration)
--max_registration_delay=(1, mins)
-       Max allowable delay to allow the driver to register before aborting
-       (org.apache.aurora.scheduler.SchedulerModule.max_registration_delay)
--max_reschedule_task_delay_on_startup=(30, secs)
-       Upper bound of random delay for pending task rescheduling on scheduler 
startup.
-       
(org.apache.aurora.scheduler.scheduling.SchedulingModule.max_reschedule_task_delay_on_startup)
--max_saved_backups=48
-       Maximum number of backups to retain before deleting the oldest backups.
-       
(org.apache.aurora.scheduler.storage.backup.BackupModule.max_saved_backups)
--max_schedule_attempts_per_sec=40.0
-       Maximum number of scheduling attempts to make per second.
-       
(org.apache.aurora.scheduler.scheduling.SchedulingModule.max_schedule_attempts_per_sec)
--max_schedule_penalty=(1, mins)
-       Maximum delay between attempts to schedule a PENDING tasks.
-       
(org.apache.aurora.scheduler.scheduling.SchedulingModule.max_schedule_penalty)
--max_status_update_batch_size=1000 [must be > 0]
-       The maximum number of status updates that can be processed in a batch.
-       
(org.apache.aurora.scheduler.SchedulerModule.max_status_update_batch_size)
--max_tasks_per_job=4000 [must be > 0]
-       Maximum number of allowed tasks in a single job.
-       (org.apache.aurora.scheduler.app.AppModule.max_tasks_per_job)
--max_update_instance_failures=20000 [must be > 0]
-       Upper limit on the number of failures allowed during a job update. This 
helps cap potentially unbounded entries into storage.
-       (org.apache.aurora.scheduler.app.AppModule.max_update_instance_failures)
--min_offer_hold_time=(5, mins)
-       Minimum amount of time to hold a resource offer before declining.
-       (org.apache.aurora.scheduler.offers.OffersModule.min_offer_hold_time)
--native_log_election_retries=20
-       The maximum number of attempts to obtain a new log writer.
-       
(org.apache.aurora.scheduler.log.mesos.MesosLogStreamModule.native_log_election_retries)
--native_log_election_timeout=(15, secs)
-       The timeout for a single attempt to obtain a new log writer.
-       
(org.apache.aurora.scheduler.log.mesos.MesosLogStreamModule.native_log_election_timeout)
--native_log_file_path
-       Path to a file to store the native log data in.  If the parent 
directory doesnot exist it will be created.
-       
(org.apache.aurora.scheduler.log.mesos.MesosLogStreamModule.native_log_file_path)
--native_log_quorum_size=1
-       The size of the quorum required for all log mutations.
-       
(org.apache.aurora.scheduler.log.mesos.MesosLogStreamModule.native_log_quorum_size)
--native_log_read_timeout=(5, secs)
-       The timeout for doing log reads.
-       
(org.apache.aurora.scheduler.log.mesos.MesosLogStreamModule.native_log_read_timeout)
--native_log_write_timeout=(3, secs)
-       The timeout for doing log appends and truncations.
-       
(org.apache.aurora.scheduler.log.mesos.MesosLogStreamModule.native_log_write_timeout)
--native_log_zk_group_path
-       A zookeeper node for use by the native log to track the master 
coordinator.
-       
(org.apache.aurora.scheduler.log.mesos.MesosLogStreamModule.native_log_zk_group_path)
--offer_hold_jitter_window=(1, mins)
-       Maximum amount of random jitter to add to the offer hold time window.
-       
(org.apache.aurora.scheduler.offers.OffersModule.offer_hold_jitter_window)
--offer_reservation_duration=(3, mins)
-       Time to reserve a slave's offers while trying to satisfy a task 
preempting another.
-       
(org.apache.aurora.scheduler.scheduling.SchedulingModule.offer_reservation_duration)
--preemption_delay=(3, mins)
-       Time interval after which a pending task becomes eligible to preempt 
other tasks
-       (org.apache.aurora.scheduler.preemptor.PreemptorModule.preemption_delay)
--preemption_slot_hold_time=(5, mins)
-       Time to hold a preemption slot found before it is discarded.
-       
(org.apache.aurora.scheduler.preemptor.PreemptorModule.preemption_slot_hold_time)
--preemption_slot_search_interval=(1, mins)
-       Time interval between pending task preemption slot searches.
-       
(org.apache.aurora.scheduler.preemptor.PreemptorModule.preemption_slot_search_interval)
--receive_revocable_resources=false
-       Allows receiving revocable resource offers from Mesos.
-       
(org.apache.aurora.scheduler.mesos.CommandLineDriverSettingsModule.receive_revocable_resources)
--reconciliation_explicit_interval=(60, mins)
-       Interval on which scheduler will ask Mesos for status updates of all 
non-terminal tasks known to scheduler.
-       
(org.apache.aurora.scheduler.reconciliation.ReconciliationModule.reconciliation_explicit_interval)
--reconciliation_implicit_interval=(60, mins)
-       Interval on which scheduler will ask Mesos for status updates of all 
non-terminal tasks known to Mesos.
-       
(org.apache.aurora.scheduler.reconciliation.ReconciliationModule.reconciliation_implicit_interval)
--reconciliation_initial_delay=(1, mins)
-       Initial amount of time to delay task reconciliation after scheduler 
start up.
-       
(org.apache.aurora.scheduler.reconciliation.ReconciliationModule.reconciliation_initial_delay)
--reconciliation_schedule_spread=(30, mins)
-       Difference between explicit and implicit reconciliation intervals 
intended to create a non-overlapping task reconciliation schedule.
-       
(org.apache.aurora.scheduler.reconciliation.ReconciliationModule.reconciliation_schedule_spread)
--shiro_ini_path
-       Path to shiro.ini for authentication and authorization configuration.
-       
(org.apache.aurora.scheduler.http.api.security.IniShiroRealmModule.shiro_ini_path)
--shiro_realm_modules=[org.apache.aurora.scheduler.app.MoreModules$1@30c15d8b]
-       Guice modules for configuring Shiro Realms.
-       
(org.apache.aurora.scheduler.http.api.security.HttpSecurityModule.shiro_realm_modules)
--sla_non_prod_metrics=[]
-       Metric categories collected for non production tasks.
-       (org.apache.aurora.scheduler.sla.SlaModule.sla_non_prod_metrics)
--sla_prod_metrics=[JOB_UPTIMES, PLATFORM_UPTIME, MEDIANS]
-       Metric categories collected for production tasks.
-       (org.apache.aurora.scheduler.sla.SlaModule.sla_prod_metrics)
--sla_stat_refresh_interval=(1, mins)
-       The SLA stat refresh interval.
-       (org.apache.aurora.scheduler.sla.SlaModule.sla_stat_refresh_interval)
--slow_query_log_threshold=(25, ms)
-       Log all queries that take at least this long to execute.
-       
(org.apache.aurora.scheduler.storage.mem.InMemStoresModule.slow_query_log_threshold)
--slow_query_log_threshold=(25, ms)
-       Log all queries that take at least this long to execute.
-       
(org.apache.aurora.scheduler.storage.db.DbModule.slow_query_log_threshold)
--stat_retention_period=(1, hrs)
-       Time for a stat to be retained in memory before expiring.
-       (org.apache.aurora.scheduler.stats.StatsModule.stat_retention_period)
--stat_sampling_interval=(1, secs)
-       Statistic value sampling interval.
-       (org.apache.aurora.scheduler.stats.StatsModule.stat_sampling_interval)
--thermos_executor_cpu=0.25
-       The number of CPU cores to allocate for each instance of the executor.
-       
(org.apache.aurora.scheduler.configuration.executor.ExecutorModule.thermos_executor_cpu)
--thermos_executor_flags
-       Extra arguments to be passed to the thermos executor
-       
(org.apache.aurora.scheduler.configuration.executor.ExecutorModule.thermos_executor_flags)
--thermos_executor_ram=(128, MB)
-       The amount of RAM to allocate for each instance of the executor.
-       
(org.apache.aurora.scheduler.configuration.executor.ExecutorModule.thermos_executor_ram)
--thermos_executor_resources=[]
-       A comma seperated list of additional resources to copy into the 
sandbox.Note: if thermos_executor_path is not the thermos_executor.pex file 
itself, this must include it.
-       
(org.apache.aurora.scheduler.configuration.executor.ExecutorModule.thermos_executor_resources)
--thermos_observer_root=/var/run/thermos
-       Path to the thermos observer root (by default /var/run/thermos.)
-       
(org.apache.aurora.scheduler.configuration.executor.ExecutorModule.thermos_observer_root)
--transient_task_state_timeout=(5, mins)
-       The amount of time after which to treat a task stuck in a transient 
state as LOST.
-       
(org.apache.aurora.scheduler.reconciliation.ReconciliationModule.transient_task_state_timeout)
--use_beta_db_task_store=false
-       Whether to use the experimental database-backed task store.
-       (org.apache.aurora.scheduler.storage.db.DbModule.use_beta_db_task_store)
--viz_job_url_prefix=
-       URL prefix for job container stats.
-       (org.apache.aurora.scheduler.app.SchedulerMain.viz_job_url_prefix)
--zk_chroot_path
-       chroot path to use for the ZooKeeper connections
-       
(org.apache.aurora.scheduler.zookeeper.guice.client.flagged.FlaggedClientConfig.zk_chroot_path)
--zk_in_proc=false
-       Launches an embedded zookeeper server for local testing causing 
-zk_endpoints to be ignored if specified.
-       
(org.apache.aurora.scheduler.zookeeper.guice.client.flagged.FlaggedClientConfig.zk_in_proc)
--zk_session_timeout=(4, secs)
-       The ZooKeeper session timeout.
-       
(org.apache.aurora.scheduler.zookeeper.guice.client.flagged.FlaggedClientConfig.zk_session_timeout)
--------------------------------------------------------------------------
-```

[2/7] aurora git commit: Reorganize Documentation

Reply via email to