[ 
https://issues.apache.org/jira/browse/HDDS-4107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190638#comment-17190638
 ] 

Glen Geng commented on HDDS-4107:
---------------------------------

I am working on the upgrade issues for this PR, found that renaming the volume 
dir is not enough.

The scmId is not only in the volume dir, but in meta data of container as well, 
which will make the upgrade procedure for a huge cluster to be impossible.

 

Here is an example from the misc/upgrade case of acceptance.

Version file as follows:

 
{code:java}
[hadoop@9 
~/glengeng/hadoop-ozone/hadoop-ozone/dist/target/ozone-0.6.0-SNAPSHOT/compose/upgrade/data]$
 cat scm/metadata/scm/current/VERSION 
#Fri Sep 04 07:08:47 UTC 2020
cTime=1599203327270
clusterID=CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5
nodeType=SCM
scmUuid=9176f875-c8f2-4dcd-8d1d-5b988ad25914
{code}
 

 

the layout after rename the volume layout from scmId to clusterId

 
{code:java}
[hadoop@9 
~/glengeng/hadoop-ozone/hadoop-ozone/dist/target/ozone-0.6.0-SNAPSHOT/compose/upgrade/data/dn2/hdds]$
 tree .
.
|-- hdds
|   |-- CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5
|   |   `-- current
|   |       `-- containerDir0
|   |           `-- 1
|   |               |-- chunks
|   |               |   `-- 104805390775943168_chunk_1
|   |               `-- metadata
|   |                   |-- 1-dn-container.db
|   |                   |   |-- 000006.log
|   |                   |   |-- CURRENT
|   |                   |   |-- IDENTITY
|   |                   |   |-- LOCK
|   |                   |   |-- LOG
|   |                   |   |-- LOG.old.1599203351819700
|   |                   |   |-- MANIFEST-000005
|   |                   |   |-- OPTIONS-000005
|   |                   |   `-- OPTIONS-000008
|   |                   `-- 1.container
|   `-- VERSION
`-- scmUsed8 directories, 13 files
{code}
 

grep scmId, and found them from meta data of container.

 
{code:java}
[hadoop@9 
~/glengeng/hadoop-ozone/hadoop-ozone/dist/target/ozone-0.6.0-SNAPSHOT/compose/upgrade/data/dn2]$
 find . -type f | xargs grep 9176f875
./hdds/hdds/CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5/current/containerDir0/1/metadata/1-dn-container.db/OPTIONS-000008:
  
wal_dir=/data/hdds/hdds/9176f875-c8f2-4dcd-8d1d-5b988ad25914/current/containerDir0/1/metadata/1-dn-container.db
./hdds/hdds/CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5/current/containerDir0/1/metadata/1-dn-container.db/LOG:2020/09/04-07:09:11.819945
 7fe6f187b700 SST files in 
/data/hdds/hdds/9176f875-c8f2-4dcd-8d1d-5b988ad25914/current/containerDir0/1/metadata/1-dn-container.db
 dir, Total Num: 0, files: 
./hdds/hdds/CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5/current/containerDir0/1/metadata/1-dn-container.db/LOG:2020/09/04-07:09:11.819947
 7fe6f187b700 Write Ahead Log file in 
/data/hdds/hdds/9176f875-c8f2-4dcd-8d1d-5b988ad25914/current/containerDir0/1/metadata/1-dn-container.db:
 000003.log size: 0 ; 
./hdds/hdds/CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5/current/containerDir0/1/metadata/1-dn-container.db/LOG:2020/09/04-07:09:11.819972
 7fe6f187b700                                 Options.wal_dir: 
/data/hdds/hdds/9176f875-c8f2-4dcd-8d1d-5b988ad25914/current/containerDir0/1/metadata/1-dn-container.db
./hdds/hdds/CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5/current/containerDir0/1/metadata/1-dn-container.db/LOG:2020/09/04-07:09:11.820954
 7fe6f187b700 [/version_set.cc:3731] Recovered from manifest 
file:/data/hdds/hdds/9176f875-c8f2-4dcd-8d1d-5b988ad25914/current/containerDir0/1/metadata/1-dn-container.db/MANIFEST-000001
 succeeded,manifest_file_number is 1, next_file_number is 3, last_sequence is 
0, log_number is 0,prev_log_number is 0,max_column_family is 
0,min_log_number_to_keep is 0
./hdds/hdds/CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5/current/containerDir0/1/metadata/1-dn-container.db/OPTIONS-000005:
  
wal_dir=/data/hdds/hdds/9176f875-c8f2-4dcd-8d1d-5b988ad25914/current/containerDir0/1/metadata/1-dn-container.db
./hdds/hdds/CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5/current/containerDir0/1/metadata/1-dn-container.db/LOG.old.1599203351819700:2020/09/04-07:09:11.716714
 7fe702591700 SST files in 
/data/hdds/hdds/9176f875-c8f2-4dcd-8d1d-5b988ad25914/current/containerDir0/1/metadata/1-dn-container.db
 dir, Total Num: 0, files: 
./hdds/hdds/CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5/current/containerDir0/1/metadata/1-dn-container.db/LOG.old.1599203351819700:2020/09/04-07:09:11.716716
 7fe702591700 Write Ahead Log file in 
/data/hdds/hdds/9176f875-c8f2-4dcd-8d1d-5b988ad25914/current/containerDir0/1/metadata/1-dn-container.db:
 
./hdds/hdds/CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5/current/containerDir0/1/metadata/1-dn-container.db/LOG.old.1599203351819700:2020/09/04-07:09:11.716737
 7fe702591700                                 Options.wal_dir: 
/data/hdds/hdds/9176f875-c8f2-4dcd-8d1d-5b988ad25914/current/containerDir0/1/metadata/1-dn-container.db
./hdds/hdds/CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5/current/containerDir0/1/metadata/1-dn-container.db/LOG.old.1599203351819700:2020/09/04-07:09:11.739510
 7fe702591700 [/version_set.cc:3731] Recovered from manifest 
file:/data/hdds/hdds/9176f875-c8f2-4dcd-8d1d-5b988ad25914/current/containerDir0/1/metadata/1-dn-container.db/MANIFEST-000001
 succeeded,manifest_file_number is 1, next_file_number is 3, last_sequence is 
0, log_number is 0,prev_log_number is 0,max_column_family is 
0,min_log_number_to_keep is 0
./hdds/hdds/CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5/current/containerDir0/1/metadata/1.container:chunksPath:
 
/data/hdds/hdds/9176f875-c8f2-4dcd-8d1d-5b988ad25914/current/containerDir0/1/chunks
./hdds/hdds/CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5/current/containerDir0/1/metadata/1.container:metadataPath:
 
/data/hdds/hdds/9176f875-c8f2-4dcd-8d1d-5b988ad25914/current/containerDir0/1/metadata
{code}
The issue is 
{code:java}
dn2_1    | 2020-09-04 07:09:34,649 [RatisApplyTransactionExecutor 1] INFO 
keyvalue.KeyValueHandler: Operation: WriteChunk , Trace ID: 
484180146a7dfd81:e085484a411b0492:484180146a7dfd81:0 , Message: Internal error: 
 , Result: IO_EXCEPTION , StorageContainerException Occurred.
dn2_1    | 
org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: 
Internal error:
dn2_1    |      at 
org.apache.hadoop.ozone.container.keyvalue.impl.FilePerChunkStrategy.writeChunk(FilePerChunkStrategy.java:181)
dn2_1    |      at 
org.apache.hadoop.ozone.container.keyvalue.impl.ChunkManagerDispatcher.writeChunk(ChunkManagerDispatcher.java:70)
dn2_1    |      at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handleWriteChunk(KeyValueHandler.java:712)
dn2_1    |      at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.dispatchRequest(KeyValueHandler.java:191)
dn2_1    |      at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(KeyValueHandler.java:155)
dn2_1    |      at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:304)
dn2_1    |      at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:166)
dn2_1    |      at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.dispatchCommand(ContainerStateMachine.java:400)
dn2_1    |      at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.runCommand(ContainerStateMachine.java:410)
dn2_1    |      at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.lambda$applyTransaction$42(ContainerStateMachine.java:754)
dn2_1    |      at 
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
dn2_1    |      at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
dn2_1    |      at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
dn2_1    |      at java.base/java.lang.Thread.run(Thread.java:834)
dn2_1    | Caused by: java.nio.file.NoSuchFileException: 
/data/hdds/hdds/9176f875-c8f2-4dcd-8d1d-5b988ad25914/current/containerDir0/1/chunks/104805390775943168_chunk_1.tmp.1.1
dn2_1    |      at 
java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
dn2_1    |      at 
java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
dn2_1    |      at 
java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
dn2_1    |      at java.base/sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:430)
dn2_1    |      at 
java.base/sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:267)
dn2_1    |      at java.base/java.nio.file.Files.move(Files.java:1421)
dn2_1    |      at 
org.apache.hadoop.ozone.container.keyvalue.impl.FilePerChunkStrategy.commitChunk(FilePerChunkStrategy.java:366)
dn2_1    |      at 
org.apache.hadoop.ozone.container.keyvalue.impl.FilePerChunkStrategy.writeChunk(FilePerChunkStrategy.java:165)
dn2_1    |      ... 13 more
dn2_1    | 2020-09-04 07:09:34,666 [RatisApplyTransactionExecutor 1] WARN 
keyvalue.KeyValueHandler: Unexpected error while marking container 1 as 
unhealthy
{code}
 

> replace scmID with clusterID for container and volume at Datanode side
> ----------------------------------------------------------------------
>
>                 Key: HDDS-4107
>                 URL: https://issues.apache.org/jira/browse/HDDS-4107
>             Project: Hadoop Distributed Data Store
>          Issue Type: Sub-task
>          Components: SCM
>            Reporter: Glen Geng
>            Assignee: Glen Geng
>            Priority: Major
>              Labels: backward-incompatible, pull-request-available, upgrade
>
> The disk layout per volume is as follows:
> {code:java}
> ../hdds/VERSION
> ../hdds/<<scmUuid>>/current/<<containerDir>>/<<containerID>>/metadata
> ../hdds/<<scmUuid>>/current/<<containerDir>>/<<containerID>>/<<dataDir>>{code}
> However, after SCM-HA is enabled, a typical SCM group will consists of 3 
> SCMs, each of the SCMs has its own scmUuid, meanwhile share the same 
> clusterID.
> Since federation is not supported yet, only one cluster is supported now, 
> this Jira will change scmID to clusterID for container and volume at Datanode 
> side.
> The disk layout after the change will be as follows:
> {code:java}
> ../hdds/VERSION
> ../hdds/<<clusterID>>/current/<<containerDir>>/<<containerID>>/metadata
> ../hdds/<<clusterID>>/current/<<containerDir>>/<<containerID>>/<<dataDir>>{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

Reply via email to