Nick,

Feels like a nice shaped API. See inline comments.

-- 
Mike.

On Wed, 23 Jan 2019, at 19:09, Nick Vatamaniuc wrote:
> ## API Spec
> 
> * `GET /_shard_splits`
> 
> Get a summary of shard splitting for the whole cluster. This would return
> the total number of shard splitting jobs and the number of active ones,
> that is the ones that are doing work at that very moment. Another piece of
> information is the global state of shard splitting, if it is stopped or
> running.
> 
> {
>     "jobs_total": 10,
>     "jobs_running": 2,
>     "states": {
>         "running": [
>             "node1@127.0.0.1",
>             "node2@127.0.0.1",
>             "node3@127.0.0.1"
>         ]
>     }
> }

In the DELETE section, you note that completed jobs stick around (so you can 
see the results and so on).

Given the different states available, should the status be something like:

jobs: {
  total: 20,
  waiting: 10,
  running: 5,
  completed: 5
}

(Modulo question below about whether a queue is involved).

There's also a question in my head as to whether all the waiting/running go 
into a "stopped" state when shard splitting is stopped on a cluster. I would 
suggest yes, in which case the above might move to:

jobs: {
  total: 20,
  stopped: 15,
  completed: 5
}

> * `PUT /_shard_splits`
> 
> Enable or disable shard splitting on the cluster. This feature that would
...
> 
> An alternative for this would be to have another underscore path like
> `_shard_splits/_state` but I feel it is better to minimize the use of
> underscore path, they feel less REST-ful.

`_shard_splits/state` is definitely more REST-like I think, as I'd say state is 
a sub-resource. I'm not sure why this would need to be `_state`.

I would likely expose it as more of a state machine, with the POST requests 
having matching schema regardless of the state transition requested:

{ "state": "stopped/started", "reason": "maintenance ticket 1234" }

My assumption is that this is an async call and you'd need to `GET 
/_shard_splits/state` to get whether the transition was completed (i.e., all 
jobs in stopped state). Having the sub-resource also gives you the opportunity 
for this `GET` call.

> * `GET /_shard_splits/jobs`
> 
> Get all shard split jobs
> 
> Response body:
> 
> {
>     "jobs": [
>         {
>             "id":
> "001-e41e8751873b56e4beafa373823604d26a2f11ba434a040f865b48df835ccb0b",
>             "job_state": "completed",
...
>         }
>     ],
>     "offset": 0,
>     "total_rows": 1
> }

My main question here was whether you needed all the pieces of information in 
this response -- is for example just the job_state enough?

> * `POST /_shard_splits/jobs`
> 
> Start a shard splitting job.
> 
> Request body:
> 
> {
>     "node": "dbc...@db1.sandbox001.cloudant.net",
>     "shard": "shards/00000000-FFFFFFFF/username/dbname.$timestamp"
> }

How would you see this looking if we wanted to split all replicas of a shard? 
It'd be nice to avoid having to pass in an array of nodes for this if possible.

> 
> Response body:
> 
> {
>     "id":
> "001-e41e8751873b56e4beafa373823604d26a2f11ba434a040f865b48df835ccb0b",
>     "ok": true
> }
> 
> Or if there are too many shard splitting jobs (a limit inspired by
> scheduling replicator as well) it might return an error:
> 
> {
>     "error": "max_jobs_exceeded",
>     "reason" "There are $N jobs currently running"
> }

Is there a queue in play here, or do jobs either go straight to running or get 
rejected?

> If shard splitting is disabled globally, user get an error and a reason.
> The reason here would be the reason sent in the `PUT /_shard_splits` body.
> 
> {
>     "error": "stopped",
>     "reason": "Shard splitting is disabled on the cluster currently"
> }

👍

> * `GET /_shard_splits/jobs/$jobid`

My thought for this one was that it'd be more useful to the user to enumerate 
the transition states and times the job has been through rather than have a 
single created/started/updated -- given that there's an internal state machine.

{
    "id":
"001-5f553fd2d9180c74aa39c35377fe3e1731d09ec39bbd0f02541f55148e48d888",
    "job_state": "completed",
    "node": "node1@127.0.0.1",
    "source": "shards/00000000-1fffffff/db.1548186810",
    "split_state": "completed",
    "state_info": {},
    "targets": [
        "shards/00000000-0fffffff/db.1548186810",
        "shards/10000000-1fffffff/db.1548186810"
    ],
    "transitions": [
        {"state": "created", ts: "2019-01-23T18:36:17.951228Z"},
        {"state": "running", ts: "2019-01-23T18:36:18.457231Z"},
        {"state": "stopped", ts: "2019-01-23T18:49:19.174453Z"},
        {"state": "running", ts: "2019-01-23T19:49:19.174453Z"},
        {"state": "completed", ts: "2019-01-23T19:55:19.174453Z"}
  ]
}

ts being the time the state was entered (and by definition the time the 
previous state was exited).

> * `DELETE /_shard_splits/jobs/$jobid`
> 
> Remove a job. After a job completes or fails, it will not be automatically
> removed but will stay around to allow the user to retrieve its status.
> After its status is inspected the user should use the DELETE method to
> remove the job. If the job is running, it will be cancelled and removed
> from the system.
> 
> Response body:
> 
> {
>     "ok": true
> }
>

Is there a way to stop/cancel a splitting job? `POST 
/_shard_splits/jobs/$jobid/state`? I guess DELETE could handle this but I could 
imagine wanting to see that a job was stopped rather than it just disappearing.

Mike.

Reply via email to