Rayman created SAMZA-1992:
-----------------------------

             Summary: Hot standby containers for improving recovery-time of 
Stateful jobs
                 Key: SAMZA-1992
                 URL: https://issues.apache.org/jira/browse/SAMZA-1992
             Project: Samza
          Issue Type: Improvement
            Reporter: Rayman


Recovery time for jobs with large state increases significantly in host-failure 
scenarios. This is problematic because of two reasons: 
a) causes _unavailability_ , 

and  
b) causes _backlog_ _buildup_ in case of jobs with high input QPS, requiring 
scaleup (then scale-down) or causes increased catch-up time.

 

The solution comprises of two parts 

*1. Increasing restore parallelism*: Samza restores stores at SamzaContainer 
startup sequentially for each task (using TaskStorageManager and 
TaskSideInputStorage Manager). Parallelizing task restores. We can parallelize 
store restores either in SamzaContainer (using a bounded/configurable 
threadpool) or in the TaskInstance or in the TaskStorageManager.

*2. Hot-Standby Containers:* 
A  Samza container which consumes input, reads or updates state, or invokes 
external services or produces outputs is called an *“active container.”*  
A *hot-standby container* is one which carries updated or hot KV state, and 
guarantees that, when it is used for a failover, its KV state corresponds to 
atleast the last checkpoint of the corresponding active container (for each 
task). There are multiple ways of building such hot-standby container 
implementations, see this section for pros and cons. Restore time With 
hot-standby container, restore times should be similar to host-affinity 
restore-times -- 10s of seconds _regardless of state size_.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to