Kai Huang created AURORA-1929:
---------------------------------

             Summary: Improve explicit task history pruning.
                 Key: AURORA-1929
                 URL: https://issues.apache.org/jira/browse/AURORA-1929
             Project: Aurora
          Issue Type: Task
            Reporter: Kai Huang
            Assignee: Kai Huang
            Priority: Minor


There are currently two types of task history pruning running by aurora:
#The implicit task history pruning running by TaskHistoryPrunner in the 
background, which registers all inactive tasks upon terminal state change for 
pruning.
#The explicit task history pruning initiated by `aurora_admin prune_tasks` 
command, which prunes inactive tasks in the cluster.

The prune_tasks endpoint seems to be very slow when the cluster has a large 
number of inactive tasks. For example, when we use $ aurora_admin prune_tasks 
for 135k running tasks (1k jobs), it takes about ~30 minutes to prune all 
tasks, the pruning speed seems to max out at 3k tasks per minute.

Currently, aurora uses StreamManager to manages a single log stream append 
transaction for task history pruning. Local storage ops(RemoveTasks) can be 
added to the transaction and then later committed as an atomic unit. 

However, the current implementation remove tasks one by one in a 
for-loop(https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java#L376),
 and coalesces each RemoveTasks operation with its previous operation, which 
seems inefficient and unnecessary 
(https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d57069adda434/src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java#L324).

We need to batch all removeTasks operations and execute them all at once to 
avoid additional cost of coalescing. The fix will also benefit implicit task 
history pruning since it has similar underlying implementation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to