[ 
https://issues.apache.org/jira/browse/HBASE-28042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-28042.
----------------------------------
    Hadoop Flags: Reviewed
      Resolution: Fixed

> Snapshot corruptions due to non-atomic rename within same filesystem
> --------------------------------------------------------------------
>
>                 Key: HBASE-28042
>                 URL: https://issues.apache.org/jira/browse/HBASE-28042
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Viraj Jasani
>            Assignee: Viraj Jasani
>            Priority: Major
>             Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>
> Sequence of events that can lead to snapshot corruptions:
>  # Create snapshot using admin command
>  # Active master triggers async snapshot creation
>  # If the snapshot operation doesn't complete within 5 min, client gets 
> exception
> {code:java}
> org.apache.hadoop.hbase.snapshot.SnapshotCreationException: Snapshot 
> 'T1_1691888405683_1691888440827_1' wasn't completed in expectedTime:600000 ms 
>   {code}
>  # Client initiates snapshot deletion after this error
>  # In the snapshot completion/commit phase, the files are moved from tmp to 
> final dir.
>  # Snapshot delete and snapshot commit operations can cause corruption by 
> leaving incomplete metadata:
>  * [Snapshot commit] create 
> "/hbase/.hbase-snapshot/T1_1691888405683_1691888440827_1/.snapshotinfo"
>  * [Snapshot delete from client]  delete 
> "/hbase/.hbase-snapshot/T1_1691888405683_1691888440827_1/.snapshotinfo"
>  * [Snapshot commit]  create 
> "/hbase/.hbase-snapshot/T1_1691888405683_1691888440827_1/data-manifest"
>  
> The changes introduced by HBASE-21098 performs atomic rename for hbase 1 but 
> not for hbase 2
> {code:java}
>   public static void completeSnapshot(Path snapshotDir, Path workingDir, 
> FileSystem fs,
>     FileSystem workingDirFs, final Configuration conf)
>     throws SnapshotCreationException, IOException {
>     LOG.debug(
>       "Sentinel is done, just moving the snapshot from " + workingDir + " to 
> " + snapshotDir);
>     URI workingURI = workingDirFs.getUri();
>     URI rootURI = fs.getUri();
>     if (
>       (!workingURI.getScheme().equals(rootURI.getScheme()) || 
> workingURI.getAuthority() == null
>         || !workingURI.getAuthority().equals(rootURI.getAuthority())
>         || workingURI.getUserInfo() == null //always true for hdfs://{cluster}
>         || !workingURI.getUserInfo().equals(rootURI.getUserInfo())
>         || !fs.rename(workingDir, snapshotDir)) //this condition isn't even 
> evaluated due to short circuit above
>         && !FileUtil.copy(workingDirFs, workingDir, fs, snapshotDir, true, 
> true, conf) // non-atomic rename operation
>     ) {
>       throw new SnapshotCreationException("Failed to copy working directory(" 
> + workingDir
>         + ") to completed directory(" + snapshotDir + ").");
>     }
>   } {code}
> whereas for hbase 1
> {code:java}
> // check UGI/userInfo
> if (workingURI.getUserInfo() == null && rootURI.getUserInfo() != null) {
>   return true;
> }
> if (workingURI.getUserInfo() != null &&
>     !workingURI.getUserInfo().equals(rootURI.getUserInfo())) {
>   return true;
> }
>  {code}
> this causes shouldSkipRenameSnapshotDirectories() to return false if 
> workingURI and rootURI share the same filesystem, which would always lead to 
> atomic rename:
> {code:java}
> if ((shouldSkipRenameSnapshotDirectories(workingURI, rootURI)
>     || !fs.rename(workingDir, snapshotDir))
>      && !FileUtil.copy(workingDirFs, workingDir, fs, snapshotDir, true, true, 
> conf)) {
>   throw new SnapshotCreationException("Failed to copy working directory(" + 
> workingDir
>       + ") to completed directory(" + snapshotDir + ").");
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to