> In this way, worker-server is not using lock or connecting db anymore. And it 
> can execute task concurrently.
It's good.I agree with it.

In a real operating environment, often appeard the network is bad, or there is 
an exception in the master and worker.
I'm more interested in fault tolerance,I also hope to discuss more about fault 
tolerance.

For example, talk about fault tolerance of the worker
1、The master task has been successfully pushed and task is running in 
worker,but worker is down.
If this happened,How this task was taken over by other workers?
If the master cached the workerlist in memory,then reday to resend the task to 
other worker,this master down at this time.
Can  other master get the workerlist of this master and only one master resend 
the task to one of other worker?

2、If after the task execution is completed, the worker down without reporting 
the information to the master.
If this happened,How to update the status of this task?Now if this 
happened,Master will choose other worker to execute this task.
But I think if this is a long connetion task,the above method also need to 
improve.

________________________________
DolphinScheduler(Incubator) PPMC
Gang Li 李岗

[email protected]<mailto:[email protected]>

From: guo jiwei<mailto:[email protected]>
Date: 2019-12-09 19:19
To: dev<mailto:[email protected]>
Subject: A proposal for DolphinScheduler- refactor WorkerServer/MasterServer

Hello everyone ,

    I would like to share some ideas about refactoring 
WorkerServer/MasterServer for dolphin-scheduler.

Background
   For current implement of dolphin-scheduler, task info are stored in 
zookeeper , and worker-server is using zookeeper lock to keep executing task 
continuously. Each worker will try to acquire lock , and if it gets the lock, 
it has the ability to execute the task (fetch task from zk, get task info from 
db, and etc), or it has to wait for the lock . This is not a nice way,  for 
performance, dependance or parallelism.

Proposal
    I suggest worker-server execute task in a way like listening tcp port, and 
receive task command via rpc request instead.  In this way, worker-server is 
not using lock or connecting db anymore. And it can execute task concurrently.

General Implementation
   1.  Refactor worker-server as a tcp server listening some port using Netty 
for tcp communication. And we define our own binary protocol for scheduling.
   2.  After starting worker-server, register itself in zookeeper , ephemeral 
node like /dolphinscheduler/nodes/worker/test/xxx.xxx.xxx.xxx:9800.  
{/dolphinscheduler/nodes/worker/$workerGroup/ip:port}
   3.  MasterServer has take the responsibility for trigger the task, and 
choose worker-server to execute it 。
     - first, we have to listen for worker nodes in zookeeper. And cache the 
worker list in memory .
    -  second,  when  MasterScheduleThread acquire the lock to execute 
command(t_ds_command) ,  it will extract the task info process instance, then 
establish a tcp connection to target available worker using task info , and 
send the command to worker.
   4. When worker-server receives a task command , it will deserialize the 
command into a Task, and execute in thread pool or subprocess.
   5. After worker-server executes the task , it has to report the result using 
the pre-connected socket to MasterServer。but if the socket is closed in any 
way, WorkerServer has to connect to any other MasterServer to report the result.


More detail for MasterServer/WorkerServer can be discuss in later mail

Simple graph
   [X]
[image.png]




<https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=Tboy&uid=technotboy%40gmail.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22technoboy%40apache.org%22%5D>
[https://mail-online.nosdn.127.net/qiyelogo/defaultAvatar.png]
Tboy
[email protected]

Reply via email to