[ https://issues.apache.org/jira/browse/HBASE-20431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16441521#comment-16441521 ]
Andrew Purtell commented on HBASE-20431: ---------------------------------------- Description subject to change as this idea is thought through. > Store commit transaction for filesystems that do not support an atomic rename > ----------------------------------------------------------------------------- > > Key: HBASE-20431 > URL: https://issues.apache.org/jira/browse/HBASE-20431 > Project: HBase > Issue Type: Sub-task > Reporter: Andrew Purtell > Priority: Major > > HBase expects the Hadoop filesystem implementation to support an atomic > rename() operation. HDFS does. The S3 backed filesystems do not. The > fundamental issue is the non-atomic and eventually consistent nature of the > S3 service. A S3 bucket is not a filesystem. S3 is not always immediately > read-your-writes. Object metadata can be temporarily inconsistent just after > new objects are stored. There can be a settling period to ride over. > Renaming/moving objects from one path to another are copy operations with > O(file) complexity and O(data) time followed by a series of deletes with > O(file) complexity. Failures at any point prior to completion will leave the > operation in an inconsistent state. The missing atomic rename semantic opens > opportunities for corruption and data loss, which may or may not be > repairable with HBCK. > Handling this at the HBase level could be done with a new multi-step > filesystem transaction framework. Call it StoreCommitTransaction. > SplitTransaction and MergeTransaction are well established cases where even > on HDFS we have non-atomic filesystem changes and are our implementation > template for the new work. In this new StoreCommitTransaction we'd be moving > flush and compaction temporaries out of the temporary directory into the > region store directory. On HDFS the implementation would be easy. We can rely > on the filesystem's atomic rename semantics. On S3 it would be work: First we > would build the list of objects to move, then copy each object into the > destination, and then finally delete all objects at the original path. We > must handle transient errors with retry strategies appropriate for the action > at hand. We must handle serious or permanent errors where the RS doesn't need > to be aborted with a rollback that cleans it all up. Finally, we must handle > permanent errors where the RS must be aborted with a rollback during region > open/recovery. Note that after all objects have been copied and we are > deleting obsolete source objects we must roll forward, not back. To support > recovery after an abort we must utilize the WAL to track transaction > progress. Put markers in for StoreCommitTransaction start and completion > state, with details of the store file(s) involved, so it can be rolled back > during region recovery at open. This will be significant work in HFile, > HStore, flusher, compactor, and HRegion. Wherever we use HDFS's rename now we > would substitute the running of this new multi-step filesystem transaction. > We need to determine this for certain, but I believe on S3 the PUT or > multipart upload of an object must complete before the object is visible, so > we don't have to worry about the case where an object is visible before fully > uploaded as part of normal operations. So an individual object copy will > either happen entirely and the target will then become visible, or it won't > and the target won't exist. > S3 has an optimization, PUT COPY > (https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectCOPY.html), which > the AmazonClient embedded in S3A utilizes for moves. When designing the > StoreCommitTransaction be sure to allow for filesystem implementations that > leverage a server side copy operation. Doing a get-then-put should be > optional. (Not sure Hadoop has an interface that advertises this capability > yet; we can add one if not.) -- This message was sent by Atlassian JIRA (v7.6.3#76005)