[ https://issues.apache.org/jira/browse/HDDS-9022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17743995#comment-17743995 ]
Ethan Rose edited comment on HDDS-9022 at 7/17/23 10:21 PM: ------------------------------------------------------------ I believe I found the problem. The disk check is incorrectly started right when the datanode starts up, without waiting for the version endpoint task and volume checks to finish initializing the volumes. This differs from the container scanner whose startup is correctly delayed until the version endpoint task completes. The log shows volume checks being scheduled before the end point task completes: {code:java} $ grep -e 'ERROR.*RunningDatanodeState' -e 'Scheduling a check for' *.log 2023-07-15 10:06:38,026 [main] INFO org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker: Scheduling a check for /hadoop/ozone/data/disk1/datanode/data/hdds 2023-07-15 10:06:38,039 [main] INFO org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker: Scheduling a check for /hadoop/ozone/data/disk1/datanode 2023-07-15 10:06:43,544 [Datanode State Machine Daemon Thread] ERROR org.apache.hadoop.ozone.container.common.states.datanode.RunningDatanodeState: Error in executing end point task. 2023-07-15 10:06:53,914 [Datanode State Machine Daemon Thread] ERROR org.apache.hadoop.ozone.container.common.states.datanode.RunningDatanodeState: Error in executing end point task. {code} The bug looks to have existed since HDDS-5268, but it is only showing up now after HDDS-8782 because previously the volume checks were only doing trivial checks on the top level volume directory. Repro would require the perfect set of events, which seems to have happened here: * Datanode takes a longer time connecting to SCM when it is started for the very first time only. ** This gives the volume scanner a chance to run before the endpoint task completes volume initialization. ** This is likely why the issue could not be reproduced in docker. * Datanode working directory is not writeable ** When the [parent directory of the disk check files is null|https://github.com/apache/ozone/blob/153659032bcdc177f7a0e0afc6ff97c06fc2a76a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java#L134C1-L134C1] because the volume is not initialized, The disk check file becomes a relative path and it is written to the datanode working directory instead. ** If the datanode can write to this directory, the error would probably be silent and correct itself once volume init finishes. * Disk check gap is set to 0 and disk check interval is small, so volume scans are scheduled immediately on startup. The fix should be to avoid starting the volume checks in the [OzoneContainer constructor|https://github.com/apache/ozone/blob/2242f5ca91ac802f3d5ddaca1c9f7389012f8227/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java#L144] and instead move them to [OzoneContainer#start|https://github.com/apache/ozone/blob/2242f5ca91ac802f3d5ddaca1c9f7389012f8227/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java#L400] like the container scanners, which is called [after the version end point task finishes|https://github.com/apache/ozone/blob/956385a4496a634bd3f5232e7201cc9c1de8db22/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/states/endpoint/VersionEndpointTask.java#L88]. I will have a patch for this soon. was (Author: erose): I believe I found the problem. The disk check is incorrectly started right when the datanode starts up, without waiting for the version endpoint task and volume checks to finish initializing the volumes. This differs from the container scanner whose startup is correctly delayed until the version endpoint task completes. The log shows volume checks being scheduled before the end point task completes: {code:java} $ grep -e 'ERROR.*RunningDatanodeState' -e 'Scheduling a check for' *.log 2023-07-15 10:06:38,026 [main] INFO org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker: Scheduling a check for /hadoop/ozone/data/disk1/datanode/data/hdds 2023-07-15 10:06:38,039 [main] INFO org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker: Scheduling a check for /hadoop/ozone/data/disk1/datanode 2023-07-15 10:06:43,544 [Datanode State Machine Daemon Thread] ERROR org.apache.hadoop.ozone.container.common.states.datanode.RunningDatanodeState: Error in executing end point task. 2023-07-15 10:06:53,914 [Datanode State Machine Daemon Thread] ERROR org.apache.hadoop.ozone.container.common.states.datanode.RunningDatanodeState: Error in executing end point task. {code} The bug looks to have existed since HDDS-5268, but it is only showing up now after HDDS-8782 because previously the volume checks were only doing trivial checks on the top level volume directory. Repro would require the perfect set of events, which seems to have happened here: * Datanode takes a longer time connecting to SCM when it is started for the very first time only. ** This gives the volume scanner a chance to run before the endpoint task completes volume initialization. ** This is likely why the issue could not be reproduced in docker. * Datanode working directory is not writeable ** When the [parent directory of the disk check files is null|https://github.com/apache/ozone/blob/153659032bcdc177f7a0e0afc6ff97c06fc2a76a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java#L134C1-L134C1] because the volume is not initialized, The disk check file becomes a relative path and it is written to the datanode working directory instead. ** If the datanode can write to this directory, the error would probably be silent and correct itself once volume init finishes. * Disk check gap is set to 0, so volume scans are scheduled immediately on startup. The fix should be to avoid starting the volume checks in the [OzoneContainer constructor|https://github.com/apache/ozone/blob/2242f5ca91ac802f3d5ddaca1c9f7389012f8227/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java#L144] and instead move them to [OzoneContainer#start|https://github.com/apache/ozone/blob/2242f5ca91ac802f3d5ddaca1c9f7389012f8227/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java#L400] like the container scanners, which is called [after the version end point task finishes|https://github.com/apache/ozone/blob/956385a4496a634bd3f5232e7201cc9c1de8db22/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/states/endpoint/VersionEndpointTask.java#L88]. I will have a patch for this soon. > DiskChecker incorrectly reporting errors > ---------------------------------------- > > Key: HDDS-9022 > URL: https://issues.apache.org/jira/browse/HDDS-9022 > Project: Apache Ozone > Issue Type: Bug > Affects Versions: 1.3.0 > Reporter: Mladjan Gadzic > Assignee: Ethan Rose > Priority: Major > Attachments: dn1.log > > > During load test of an aws based ozone cluster, we see that datanodes are > shutting down. We believe it is caused by the new DiskChecker incorrectly > reporting errors. > It happens because the class for StorageVolume's runs the check method > without initializing the diskCheckDir field. > Thanks [~xBis] for research and notes on it! > Diff showing new log messages: > {code:java} > diff --git > a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java > > b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java > index b267b1d47..1cb0d0085 100644 > --- > a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java > +++ > b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java > @@ -37,6 +37,8 @@ > * where the disk is mounted. > */ > public final class DiskCheckUtil { > + private static final Logger LOG = > + LoggerFactory.getLogger(DiskCheckUtil.class); > private DiskCheckUtil() { } > > // For testing purposes, an alternate check implementation can be provided > @@ -63,6 +65,20 @@ public static boolean checkPermissions(File storageDir) { > > public static boolean checkReadWrite(File storageDir, File testFileDir, > int numBytesToWrite) { > + if (storageDir == null) { > + LOG.info("###storageDir is null. Printing stack trace: {}", > + Arrays.toString(new NullPointerException().getStackTrace())); > + } else { > + LOG.info("###storageDir path={}", storageDir.getPath()); > + } > + > + if (testFileDir == null) { > + LOG.info("###testFileDir is null. Printing stack trace: {}", > + Arrays.toString(new NullPointerException().getStackTrace())); > + } else { > + LOG.info("###testFileDir path={}", testFileDir.getPath()); > + } > + > return impl.checkReadWrite(storageDir, testFileDir, numBytesToWrite); > } > {code} > > Stacktrace (specifically look at lines with "###") > {code:java} > 2023-07-15 10:07:38,006 [Periodic HDDS volume checker] INFO > org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker: > Scheduling a check for /hadoop/ozone/data/disk1/datanode/data/hdds > 2023-07-15 10:07:38,007 [Periodic HDDS volume checker] INFO > org.apache.hadoop.ozone.container.common.volume.StorageVolumeChecker: > Scheduled health check for volume /hadoop/ozone/data/disk1/datanode/data/hdds > 2023-07-15 10:07:38,007 [DataNode DiskChecker thread 0] INFO > org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###storageDir > path=/hadoop/ozone/data/disk1/datanode/data/hdds > 2023-07-15 10:07:38,007 [DataNode DiskChecker thread 0] INFO > org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###testFileDir > path=/hadoop/ozone/data/disk1/datanode/data/hdds/CID-2eb5a782-379b-46c7-8bd1-8b19043c1a6e/tmp/disk-check > 2023-07-15 10:07:38,010 [Periodic HDDS volume checker] INFO > org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker: > Scheduling a check for /hadoop/ozone/data/disk1/datanode > 2023-07-15 10:07:38,010 [Periodic HDDS volume checker] INFO > org.apache.hadoop.ozone.container.common.volume.StorageVolumeChecker: > Scheduled health check for volume /hadoop/ozone/data/disk1/datanode > 2023-07-15 10:07:38,010 [DataNode DiskChecker thread 0] INFO > org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###storageDir > path=/hadoop/ozone/data/disk1/datanode > 2023-07-15 10:07:38,011 [DataNode DiskChecker thread 0] INFO > org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###testFileDir > is null. Printing stack trace: > [org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil.checkReadWrite(DiskCheckUtil.java:76), > > org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:629), > > org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:68), > > org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker.lambda$schedule$0(ThrottledAsyncChecker.java:143), > > com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131), > > com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75), > > com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82), > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149), > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624), > java.lang.Thread.run(Thread.java:750)] > 2023-07-15 10:07:38,011 [DataNode DiskChecker thread 0] ERROR > org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: Volume > /hadoop/ozone/data/disk1/datanode failed health check. Could not find file > disk-check-c8f5b40b-0cf6-420e-bbc3-9fb59def11eb for volume check. > java.io.FileNotFoundException: > disk-check-c8f5b40b-0cf6-420e-bbc3-9fb59def11eb (No such file or directory) > at java.io.FileOutputStream.open0(Native Method) > at java.io.FileOutputStream.open(FileOutputStream.java:270) > at java.io.FileOutputStream.<init>(FileOutputStream.java:213) > at java.io.FileOutputStream.<init>(FileOutputStream.java:162) > at > org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil$DiskChecksImpl.checkReadWrite(DiskCheckUtil.java:153) > at > org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil.checkReadWrite(DiskCheckUtil.java:82) > at > org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:629) > at > org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:68) > at > org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker.lambda$schedule$0(ThrottledAsyncChecker.java:143) > at > com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131) > at > com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75) > at > com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > {code} > > Bug could not be reproduced using Docker cluster, only on real cluster. Steps > to reproduce it: > # Add/modify properties as such: > {noformat} > OZONE-SITE.XML_hdds.datanode.periodic.disk.check.interval.minutes=1 > OZONE-SITE.XML_hdds.datanode.disk.check.min.gap=0s{noformat} > # Wait for volume to fail health check > Check attached log file for more context. > Suggested workaround diff: > {code:java} > diff --git > a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java > > b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java > index 95d1b2c2d..634b15a8e 100644 > --- > a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java > +++ > b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java > @@ -624,6 +624,10 @@ public synchronized VolumeCheckResult check(@Nullable > Boolean unused) > return VolumeCheckResult.HEALTHY; > } > > + if (diskCheckDir == null) { > + diskCheckDir = storageDir; > + } > + > // Since IO errors may be intermittent, volume remains healthy until the > // threshold of failures is crossed. > boolean diskChecksPassed = DiskCheckUtil.checkReadWrite(storageDir, > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For additional commands, e-mail: issues-h...@ozone.apache.org