[ 
https://issues.apache.org/jira/browse/HBASE-20431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16441521#comment-16441521
 ] 

Andrew Purtell commented on HBASE-20431:
----------------------------------------

Description subject to change as this idea is thought through.


> Store commit transaction for filesystems that do not support an atomic rename
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-20431
>                 URL: https://issues.apache.org/jira/browse/HBASE-20431
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Andrew Purtell
>            Priority: Major
>
> HBase expects the Hadoop filesystem implementation to support an atomic 
> rename() operation. HDFS does. The S3 backed filesystems do not. The 
> fundamental issue is the non-atomic and eventually consistent nature of the 
> S3 service. A S3 bucket is not a filesystem. S3 is not always immediately 
> read-your-writes. Object metadata can be temporarily inconsistent just after 
> new objects are stored. There can be a settling period to ride over. 
> Renaming/moving objects from one path to another are copy operations with 
> O(file) complexity and O(data) time followed by a series of deletes with 
> O(file) complexity. Failures at any point prior to completion will leave the 
> operation in an inconsistent state. The missing atomic rename semantic opens 
> opportunities for corruption and data loss, which may or may not be 
> repairable with HBCK.
> Handling this at the HBase level could be done with a new multi-step 
> filesystem transaction framework. Call it StoreCommitTransaction. 
> SplitTransaction and MergeTransaction are well established cases where even 
> on HDFS we have non-atomic filesystem changes and are our implementation 
> template for the new work. In this new StoreCommitTransaction we'd be moving 
> flush and compaction temporaries out of the temporary directory into the 
> region store directory. On HDFS the implementation would be easy. We can rely 
> on the filesystem's atomic rename semantics. On S3 it would be work: First we 
> would build the list of objects to move, then copy each object into the 
> destination, and then finally delete all objects at the original path. We 
> must handle transient errors with retry strategies appropriate for the action 
> at hand. We must handle serious or permanent errors where the RS doesn't need 
> to be aborted with a rollback that cleans it all up. Finally, we must handle 
> permanent errors where the RS must be aborted with a rollback during region 
> open/recovery. Note that after all objects have been copied and we are 
> deleting obsolete source objects we must roll forward, not back. To support 
> recovery after an abort we must utilize the WAL to track transaction 
> progress. Put markers in for StoreCommitTransaction start and completion 
> state, with details of the store file(s) involved, so it can be rolled back 
> during region recovery at open. This will be significant work in HFile, 
> HStore, flusher, compactor, and HRegion. Wherever we use HDFS's rename now we 
> would substitute the running of this new multi-step filesystem transaction.
> We need to determine this for certain, but I believe on S3 the PUT or 
> multipart upload of an object must complete before the object is visible, so 
> we don't have to worry about the case where an object is visible before fully 
> uploaded as part of normal operations. So an individual object copy will 
> either happen entirely and the target will then become visible, or it won't 
> and the target won't exist.
> S3 has an optimization, PUT COPY 
> (https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectCOPY.html), which 
> the AmazonClient embedded in S3A utilizes for moves. When designing the 
> StoreCommitTransaction be sure to allow for filesystem implementations that 
> leverage a server side copy operation. Doing a get-then-put should be 
> optional. (Not sure Hadoop has an interface that advertises this capability 
> yet; we can add one if not.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to