duongkame opened a new pull request, #7240:
URL: https://github.com/apache/ozone/pull/7240

   ## What changes were proposed in this pull request?
   See HDDS-11485.
   
   
   ## What is the link to the Apache JIRA
   https://issues.apache.org/jira/browse/HDDS-11485
   
   ## How was this patch tested?
   
   Tested locally with the following scenario:
   1. Create a test docker cluster with 3 datanodes, each datanodes has a 
volume configured at `/data/hdds/hdds`.
   2. Go to a datanode, remove the database folder under the volume: 
`/data/hdds/hdds/CID-<cluster id>/DS-<storage id>`. Then restart the datanode.
   
   Result.
   Before this fix, the behavior is exactly as described as per HDDS-11485. The 
datanode loads the volume, and detects the missing database folder but doesn't 
report the it as unhealthy. When client writes data to the datanodes, ratis log 
crashes because of the NPE.
   
   After this fix, when datanode loads the volume and see the missing 
datanodes. It reports the volume as unhealthy immediately, see datanode logs:
   ```
   2024-09-26 15:44:41 2024-09-26 22:44:41,109 
[ForkJoinPool.commonPool-worker-19] INFO volume.ThrottledAsyncChecker: 
Scheduling a check for /data/hdds/hdds
   2024-09-26 15:44:41 2024-09-26 22:44:41,127 
[ForkJoinPool.commonPool-worker-19] ERROR ozoneimpl.OzoneContainer: Load db 
store for HddsVolume /data/hdds/hdds failed
   2024-09-26 15:44:41 java.io.IOException: Db parent dir 
/data/hdds/hdds/CID-730e257b-8168-41e8-b7e4-635a26af2f9b/DS-f053f5a7-efff-479a-8cef-e5d8b2ab852f
 not found for HddsVolume: /data/hdds/hdds
   2024-09-26 15:44:41     at 
org.apache.hadoop.ozone.container.common.volume.HddsVolume.loadDbStore(HddsVolume.java:370)
   2024-09-26 15:44:41     at 
org.apache.hadoop.ozone.container.common.utils.HddsVolumeUtil.loadVolume(HddsVolumeUtil.java:111)
   2024-09-26 15:44:41     at 
org.apache.hadoop.ozone.container.common.utils.HddsVolumeUtil.lambda$loadAllHddsVolumeDbStore$0(HddsVolumeUtil.java:97)
   2024-09-26 15:44:41     at 
java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
   2024-09-26 15:44:41     at 
java.base/java.util.concurrent.CompletableFuture$AsyncRun.exec(CompletableFuture.java:1728)
   2024-09-26 15:44:41     at 
java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
   2024-09-26 15:44:41     at 
java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
   2024-09-26 15:44:41     at 
java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
   2024-09-26 15:44:41     at 
java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
   2024-09-26 15:44:41     at 
java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
   2024-09-26 15:44:41 2024-09-26 22:44:41,129 [main] INFO 
ozoneimpl.OzoneContainer: Load 1 volumes DbStore cost: 22ms
   2024-09-26 15:44:41 2024-09-26 22:44:41,135 
[58680955-2588-4bda-9101-0aabff9ec06c-DataNodeDiskCheckerThread-0] WARN 
volume.HddsVolume: Volume /data/hdds/hdds failed to access RocksDB: RocksDB 
parent directory is null, the volume might not have been loaded properly.
   2024-09-26 15:44:41 2024-09-26 22:44:41,138 
[58680955-2588-4bda-9101-0aabff9ec06c-VolumeCheckResultHandlerThread-0] WARN 
volume.StorageVolumeChecker: Volume /data/hdds/hdds detected as being unhealthy
   2024-09-26 15:44:41 2024-09-26 22:44:41,139 
[58680955-2588-4bda-9101-0aabff9ec06c-VolumeCheckResultHandlerThread-0] WARN 
volume.MutableVolumeSet: checkVolumeAsync callback got 1 failed volumes: 
[/data/hdds/hdds]
   2024-09-26 15:44:41 2024-09-26 22:44:41,142 
[58680955-2588-4bda-9101-0aabff9ec06c-VolumeCheckResultHandlerThread-0] INFO 
volume.MutableVolumeSet: Moving Volume : /data/hdds/hdds to failed Volumes
   2024-09-26 15:44:41 2024-09-26 22:44:41,142 
[58680955-2588-4bda-9101-0aabff9ec06c-VolumeCheckResultHandlerThread-0] ERROR 
volume.MutableVolumeSet: Not enough volumes in MutableVolumeSet. DatanodeUUID: 
58680955-2588-4bda-9101-0aabff9ec06c, VolumeType: DATA_VOLUME, 
MaxVolumeFailuresTolerated: -1, ActiveVolumes: 0, FailedVolumes: 1
   2024-09-26 15:44:41 2024-09-26 22:44:41,142 
[58680955-2588-4bda-9101-0aabff9ec06c-VolumeCheckResultHandlerThread-0] ERROR 
statemachine.DatanodeStateMachine: DatanodeStateMachine Shutdown due to too 
many bad volumes, check hdds.datanode.failed.data.volumes.tolerated and 
hdds.datanode.failed.metadata.volumes.tolerated and 
hdds.datanode.failed.db.volumes.tolerated
   ```
   In the logs above, the unhealthy volume is also escalated to a datanode 
failure because it's the datanode's only volume.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to