Hi,

I am designing a small distributed job scheduling system with a twist -- 
each job can be re-executed (idempotent), but same job can't be executed by 
two workers in parallel. This requirement makes everything really difficult 
in presence of network/worker failures.

In essence on a high level it looks like this:
- Worker -- a process (one of many) that connects to Coordinator, receives 
jobs, executes them and submits generated sub-jobs back (if any)

- think "traversing a filesystem": "process this directory" job will 
generate a bunch of sub-jobs (one for each directory item)

- Coordinator -- maintains systems state, feeds jobs to workers
- System State -- list of jobs and their current status (executing on 
worker X, done, etc), can be just a list in memory or table in database
- job can take a very long time

I am having difficulty implementing "no parallel execution" guarantee -- if 
worker (or connection to it) goes down I need to recognize this in 
Coordinator, "pause" all jobs given worker was running and (after some 
timeout or user action) re submit jobs to another worker. Timeout (or user 
action) is required to allow worker (if it is alive) to detect network 
error and stop it's jobs and start the cycle again (try to register self 
with Coordinator, etc). It is important that once connection was deemed as 
broken -- it never reused(or worker may not notice the problem), worker is 
treated as dead until it re-registers itself (after a job purge or restart).

Can grpc help me implement this? I am feeling like reinventing a bicycle... 
This certainly can be done with raw TCP (with manual keep-alives), but I'd 
like to avoid coding all that logic.

Regards,
Michael.

-- 
You received this message because you are subscribed to the Google Groups 
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to grpc-io+unsubscr...@googlegroups.com.
To post to this group, send email to grpc-io@googlegroups.com.
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/grpc-io/77153f14-cd17-4fa9-9683-7ce225be87bd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to