Nick,
Feels like a nice shaped API. See inline comments.
--
Mike.
On Wed, 23 Jan 2019, at 19:09, Nick Vatamaniuc wrote:
> ## API Spec
>
> * `GET /_shard_splits`
>
> Get a summary of shard splitting for the whole cluster. This would return
> the total number of shard splitting jobs and the number of active ones,
> that is the ones that are doing work at that very moment. Another piece of
> information is the global state of shard splitting, if it is stopped or
> running.
>
> {
> "jobs_total": 10,
> "jobs_running": 2,
> "states": {
> "running": [
> "[email protected]",
> "[email protected]",
> "[email protected]"
> ]
> }
> }
In the DELETE section, you note that completed jobs stick around (so you can
see the results and so on).
Given the different states available, should the status be something like:
jobs: {
total: 20,
waiting: 10,
running: 5,
completed: 5
}
(Modulo question below about whether a queue is involved).
There's also a question in my head as to whether all the waiting/running go
into a "stopped" state when shard splitting is stopped on a cluster. I would
suggest yes, in which case the above might move to:
jobs: {
total: 20,
stopped: 15,
completed: 5
}
> * `PUT /_shard_splits`
>
> Enable or disable shard splitting on the cluster. This feature that would
...
>
> An alternative for this would be to have another underscore path like
> `_shard_splits/_state` but I feel it is better to minimize the use of
> underscore path, they feel less REST-ful.
`_shard_splits/state` is definitely more REST-like I think, as I'd say state is
a sub-resource. I'm not sure why this would need to be `_state`.
I would likely expose it as more of a state machine, with the POST requests
having matching schema regardless of the state transition requested:
{ "state": "stopped/started", "reason": "maintenance ticket 1234" }
My assumption is that this is an async call and you'd need to `GET
/_shard_splits/state` to get whether the transition was completed (i.e., all
jobs in stopped state). Having the sub-resource also gives you the opportunity
for this `GET` call.
> * `GET /_shard_splits/jobs`
>
> Get all shard split jobs
>
> Response body:
>
> {
> "jobs": [
> {
> "id":
> "001-e41e8751873b56e4beafa373823604d26a2f11ba434a040f865b48df835ccb0b",
> "job_state": "completed",
...
> }
> ],
> "offset": 0,
> "total_rows": 1
> }
My main question here was whether you needed all the pieces of information in
this response -- is for example just the job_state enough?
> * `POST /_shard_splits/jobs`
>
> Start a shard splitting job.
>
> Request body:
>
> {
> "node": "[email protected]",
> "shard": "shards/00000000-FFFFFFFF/username/dbname.$timestamp"
> }
How would you see this looking if we wanted to split all replicas of a shard?
It'd be nice to avoid having to pass in an array of nodes for this if possible.
>
> Response body:
>
> {
> "id":
> "001-e41e8751873b56e4beafa373823604d26a2f11ba434a040f865b48df835ccb0b",
> "ok": true
> }
>
> Or if there are too many shard splitting jobs (a limit inspired by
> scheduling replicator as well) it might return an error:
>
> {
> "error": "max_jobs_exceeded",
> "reason" "There are $N jobs currently running"
> }
Is there a queue in play here, or do jobs either go straight to running or get
rejected?
> If shard splitting is disabled globally, user get an error and a reason.
> The reason here would be the reason sent in the `PUT /_shard_splits` body.
>
> {
> "error": "stopped",
> "reason": "Shard splitting is disabled on the cluster currently"
> }
👍
> * `GET /_shard_splits/jobs/$jobid`
My thought for this one was that it'd be more useful to the user to enumerate
the transition states and times the job has been through rather than have a
single created/started/updated -- given that there's an internal state machine.
{
"id":
"001-5f553fd2d9180c74aa39c35377fe3e1731d09ec39bbd0f02541f55148e48d888",
"job_state": "completed",
"node": "[email protected]",
"source": "shards/00000000-1fffffff/db.1548186810",
"split_state": "completed",
"state_info": {},
"targets": [
"shards/00000000-0fffffff/db.1548186810",
"shards/10000000-1fffffff/db.1548186810"
],
"transitions": [
{"state": "created", ts: "2019-01-23T18:36:17.951228Z"},
{"state": "running", ts: "2019-01-23T18:36:18.457231Z"},
{"state": "stopped", ts: "2019-01-23T18:49:19.174453Z"},
{"state": "running", ts: "2019-01-23T19:49:19.174453Z"},
{"state": "completed", ts: "2019-01-23T19:55:19.174453Z"}
]
}
ts being the time the state was entered (and by definition the time the
previous state was exited).
> * `DELETE /_shard_splits/jobs/$jobid`
>
> Remove a job. After a job completes or fails, it will not be automatically
> removed but will stay around to allow the user to retrieve its status.
> After its status is inspected the user should use the DELETE method to
> remove the job. If the job is running, it will be cancelled and removed
> from the system.
>
> Response body:
>
> {
> "ok": true
> }
>
Is there a way to stop/cancel a splitting job? `POST
/_shard_splits/jobs/$jobid/state`? I guess DELETE could handle this but I could
imagine wanting to see that a job was stopped rather than it just disappearing.
Mike.