shekhars-li opened a new pull request, #1676:
URL: https://github.com/apache/samza/pull/1676

   [WIP - Do not merge yet]
   Problem Statement:
   - Yarn can sometimes create orphaned containers. In our production systems, 
we noticed that there were overlapping Samza containers running/committing at 
the same time. 
   - If the stores are backed up to a blob store, this orphaned and overlapping 
container may delete a blob (which is common during delta state calculation in 
commit lifecycle with blob store backend). The other non-orphaned container may 
expect this blob to be present. 
   - This causes the container and subsequently the job to fail. During this, 
the container fails with DeletedException - which is Blob store's response that 
the blob was present but is gone now. 
   
   Fix:
   - During commit, if a container fails with DeletedException, let it 
fail/restart.
   - During the recovery phase of the restart, get the deleted blob with get() 
call with getDeleted flag that indicates that if the blob is tombstoned but not 
compacted, blob store will return it. 
   - Recreate the new blob by uploading it to blob store afresh. Use the new 
blob id received to create a new checkpoint. 
   - After this, and as long as orphaned container is not cleaned up by Yarn, 
the container should be able to commit regulary. 
   
   Tests:
   TBD


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to