Shekhar Sharma created SAMZA-2787:
-------------------------------------

             Summary: Add GetDeleted API to Blob Store backup and restore 
managers and recover from DeletedException
                 Key: SAMZA-2787
                 URL: https://issues.apache.org/jira/browse/SAMZA-2787
             Project: Samza
          Issue Type: Improvement
            Reporter: Shekhar Sharma


Problem Statement:
 * Yarn can sometimes create orphaned containers. In our production systems, we 
noticed that there were overlapping Samza containers running/committing at the 
same time.
 * If the stores are backed up to a blob store, this orphaned and overlapping 
container may delete a blob (which is common during delta state calculation in 
commit lifecycle with blob store backend). The other non-orphaned container may 
expect this blob to be present.
 * This causes the container and subsequently the job to fail. During this, the 
container fails with DeletedException - which is Blob store's response that the 
blob was present but is gone now.

Fix:
 * During commit, if a container fails with DeletedException, let the container 
fail/restart.
 * During the recovery phase of the restart, get the deleted blob with get() 
call with getDeleted flag that indicates that if the blob is marked for 
deletion but not yet compacted, blob store should return it.
 * Recreate the new blob by uploading it to blob store afresh. Use the new blob 
id received to create a new checkpoint.
 * Write this new checkpoint to the checkpoint topic.
 * After this, and as long as orphaned container is not cleaned up by Yarn, the 
container should be able to commit regulary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to