shekhars-li opened a new pull request, #1676: URL: https://github.com/apache/samza/pull/1676
[WIP - Do not merge yet] Problem Statement: - Yarn can sometimes create orphaned containers. In our production systems, we noticed that there were overlapping Samza containers running/committing at the same time. - If the stores are backed up to a blob store, this orphaned and overlapping container may delete a blob (which is common during delta state calculation in commit lifecycle with blob store backend). The other non-orphaned container may expect this blob to be present. - This causes the container and subsequently the job to fail. During this, the container fails with DeletedException - which is Blob store's response that the blob was present but is gone now. Fix: - During commit, if a container fails with DeletedException, let it fail/restart. - During the recovery phase of the restart, get the deleted blob with get() call with getDeleted flag that indicates that if the blob is tombstoned but not compacted, blob store will return it. - Recreate the new blob by uploading it to blob store afresh. Use the new blob id received to create a new checkpoint. - After this, and as long as orphaned container is not cleaned up by Yarn, the container should be able to commit regulary. Tests: TBD -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
