Nick, Feels like a nice shaped API. See inline comments.
-- Mike. On Wed, 23 Jan 2019, at 19:09, Nick Vatamaniuc wrote: > ## API Spec > > * `GET /_shard_splits` > > Get a summary of shard splitting for the whole cluster. This would return > the total number of shard splitting jobs and the number of active ones, > that is the ones that are doing work at that very moment. Another piece of > information is the global state of shard splitting, if it is stopped or > running. > > { > "jobs_total": 10, > "jobs_running": 2, > "states": { > "running": [ > "node1@127.0.0.1", > "node2@127.0.0.1", > "node3@127.0.0.1" > ] > } > } In the DELETE section, you note that completed jobs stick around (so you can see the results and so on). Given the different states available, should the status be something like: jobs: { total: 20, waiting: 10, running: 5, completed: 5 } (Modulo question below about whether a queue is involved). There's also a question in my head as to whether all the waiting/running go into a "stopped" state when shard splitting is stopped on a cluster. I would suggest yes, in which case the above might move to: jobs: { total: 20, stopped: 15, completed: 5 } > * `PUT /_shard_splits` > > Enable or disable shard splitting on the cluster. This feature that would ... > > An alternative for this would be to have another underscore path like > `_shard_splits/_state` but I feel it is better to minimize the use of > underscore path, they feel less REST-ful. `_shard_splits/state` is definitely more REST-like I think, as I'd say state is a sub-resource. I'm not sure why this would need to be `_state`. I would likely expose it as more of a state machine, with the POST requests having matching schema regardless of the state transition requested: { "state": "stopped/started", "reason": "maintenance ticket 1234" } My assumption is that this is an async call and you'd need to `GET /_shard_splits/state` to get whether the transition was completed (i.e., all jobs in stopped state). Having the sub-resource also gives you the opportunity for this `GET` call. > * `GET /_shard_splits/jobs` > > Get all shard split jobs > > Response body: > > { > "jobs": [ > { > "id": > "001-e41e8751873b56e4beafa373823604d26a2f11ba434a040f865b48df835ccb0b", > "job_state": "completed", ... > } > ], > "offset": 0, > "total_rows": 1 > } My main question here was whether you needed all the pieces of information in this response -- is for example just the job_state enough? > * `POST /_shard_splits/jobs` > > Start a shard splitting job. > > Request body: > > { > "node": "dbc...@db1.sandbox001.cloudant.net", > "shard": "shards/00000000-FFFFFFFF/username/dbname.$timestamp" > } How would you see this looking if we wanted to split all replicas of a shard? It'd be nice to avoid having to pass in an array of nodes for this if possible. > > Response body: > > { > "id": > "001-e41e8751873b56e4beafa373823604d26a2f11ba434a040f865b48df835ccb0b", > "ok": true > } > > Or if there are too many shard splitting jobs (a limit inspired by > scheduling replicator as well) it might return an error: > > { > "error": "max_jobs_exceeded", > "reason" "There are $N jobs currently running" > } Is there a queue in play here, or do jobs either go straight to running or get rejected? > If shard splitting is disabled globally, user get an error and a reason. > The reason here would be the reason sent in the `PUT /_shard_splits` body. > > { > "error": "stopped", > "reason": "Shard splitting is disabled on the cluster currently" > } 👍 > * `GET /_shard_splits/jobs/$jobid` My thought for this one was that it'd be more useful to the user to enumerate the transition states and times the job has been through rather than have a single created/started/updated -- given that there's an internal state machine. { "id": "001-5f553fd2d9180c74aa39c35377fe3e1731d09ec39bbd0f02541f55148e48d888", "job_state": "completed", "node": "node1@127.0.0.1", "source": "shards/00000000-1fffffff/db.1548186810", "split_state": "completed", "state_info": {}, "targets": [ "shards/00000000-0fffffff/db.1548186810", "shards/10000000-1fffffff/db.1548186810" ], "transitions": [ {"state": "created", ts: "2019-01-23T18:36:17.951228Z"}, {"state": "running", ts: "2019-01-23T18:36:18.457231Z"}, {"state": "stopped", ts: "2019-01-23T18:49:19.174453Z"}, {"state": "running", ts: "2019-01-23T19:49:19.174453Z"}, {"state": "completed", ts: "2019-01-23T19:55:19.174453Z"} ] } ts being the time the state was entered (and by definition the time the previous state was exited). > * `DELETE /_shard_splits/jobs/$jobid` > > Remove a job. After a job completes or fails, it will not be automatically > removed but will stay around to allow the user to retrieve its status. > After its status is inspected the user should use the DELETE method to > remove the job. If the job is running, it will be cancelled and removed > from the system. > > Response body: > > { > "ok": true > } > Is there a way to stop/cancel a splitting job? `POST /_shard_splits/jobs/$jobid/state`? I guess DELETE could handle this but I could imagine wanting to see that a job was stopped rather than it just disappearing. Mike.