thinkharderdev commented on PR #59:
URL: https://github.com/apache/arrow-ballista/pull/59#issuecomment-1153735022

   > Thanks @thinkharderdev for the great work. The code is well implemented 
and well documented. And the task scheduling algorithm is also very 
sophisticated.
   > 
   > I only have two small concerns:
   > 
   > 1. For the stage-based task scheduling, suppose a scenario of 3 stages, 
stage3 ---> stage2--->stage1. 1K tasks for stage1, 1 task for stage2, and 1K 
tasks for stage3. When stage2 finishes, it's better to schedule the tasks for 
stage3 at all once. However, for the algorithm based on the 
_ExecutorReservation_, it seems this kind of all-at-once scheduling only 
happens when the job submitted. Maybe better to keep the previous event of 
_StageFinished_.
   > 2. For every state change for a job, like task status update, this 
_ExecutionGraph_-based implement needs to fetch and decode the _ExecutionGraph_ 
from the config backend. Will it be too heavy, especially when there're 
thousands or millions of tasks for a job? Actually the previous design of 
keeping the tasks in memory aims to reduce such kind of cost. Therefore, I 
prefer not to persist the task status info into the config backend.
   > 3. For the job scheduling policy, this implementation makes it possible 
for one job to be scheduled by multiple schedulers. However, I think it's not 
necessary. It's better to employ an active-standby policy. And make the 
recovery level to be stage level rather than task level if the job's active 
scheduler terminated. Then we can avoid the _ExecutionGraph_ decoding cost for 
every task update.
   
   Thanks @yahoNanJing for the review.
   
   1. This is a good point. I wanted to avoid locking the executor state as 
much as possible but I see that the case you mentioned is a degenerate case. 
   2. I am concerned about this as well. For simplicity I put everything into a 
single data structure but conceptually there is no reason we have to do it that 
way. We can have the `ExecutionGraph` store only the DAG structure and store 
the plan information in a separate data structure. Since the plan is immutable 
that could be cached in memory more effectively. And I think you're right in 
that the entire task status doesn't need to be stored at all. We only need to 
know whether it is pending or not. 
   3. On this point I disagree. High availability is one goal of this work but 
another is horizontal scalability. The work involved in physical planning is 
sometimes non-trivial. For instance, scanning a large Hive table can involve 
reading parquet metadata from many files and it will be useful to be able to 
have multiple active schedulers to scale this work. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to