[jira] [Comment Edited] (HDDS-9022) DiskChecker incorrectly reporting errors

Ethan Rose (Jira) Mon, 17 Jul 2023 15:23:18 -0700


    [ 
https://issues.apache.org/jira/browse/HDDS-9022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17743995#comment-17743995
 ]


Ethan Rose edited comment on HDDS-9022 at 7/17/23 10:21 PM:
------------------------------------------------------------

I believe I found the problem. The disk check is incorrectly started right when 
the datanode starts up, without waiting for the version endpoint task and 
volume checks to finish initializing the volumes. This differs from the 
container scanner whose startup is correctly delayed until the version endpoint 
task completes. The log shows volume checks being scheduled before the end 
point task completes:
{code:java}
$ grep -e 'ERROR.*RunningDatanodeState' -e 'Scheduling a check for' *.log
2023-07-15 10:06:38,026 [main] INFO 
org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker: 
Scheduling a check for /hadoop/ozone/data/disk1/datanode/data/hdds
2023-07-15 10:06:38,039 [main] INFO 
org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker: 
Scheduling a check for /hadoop/ozone/data/disk1/datanode
2023-07-15 10:06:43,544 [Datanode State Machine Daemon Thread] ERROR 
org.apache.hadoop.ozone.container.common.states.datanode.RunningDatanodeState: 
Error in executing end point task.
2023-07-15 10:06:53,914 [Datanode State Machine Daemon Thread] ERROR 
org.apache.hadoop.ozone.container.common.states.datanode.RunningDatanodeState: 
Error in executing end point task.
{code}
The bug looks to have existed since HDDS-5268, but it is only showing up now 
after HDDS-8782 because previously the volume checks were only doing trivial 
checks on the top level volume directory.

Repro would require the perfect set of events, which seems to have happened 
here:
 * Datanode takes a longer time connecting to SCM when it is started for the 
very first time only.
 ** This gives the volume scanner a chance to run before the endpoint task 
completes volume initialization.
 ** This is likely why the issue could not be reproduced in docker.
 * Datanode working directory is not writeable
 ** When the [parent directory of the disk check files is 
null|https://github.com/apache/ozone/blob/153659032bcdc177f7a0e0afc6ff97c06fc2a76a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java#L134C1-L134C1]
 because the volume is not initialized, The disk check file becomes a relative 
path and it is written to the datanode working directory instead.
 ** If the datanode can write to this directory, the error would probably be 
silent and correct itself once volume init finishes.
 * Disk check gap is set to 0 and disk check interval is small, so volume scans 
are scheduled immediately on startup.

The fix should be to avoid starting the volume checks in the [OzoneContainer 
constructor|https://github.com/apache/ozone/blob/2242f5ca91ac802f3d5ddaca1c9f7389012f8227/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java#L144]
 and instead move them to 
[OzoneContainer#start|https://github.com/apache/ozone/blob/2242f5ca91ac802f3d5ddaca1c9f7389012f8227/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java#L400]
 like the container scanners, which is called [after the version end point task 
finishes|https://github.com/apache/ozone/blob/956385a4496a634bd3f5232e7201cc9c1de8db22/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/states/endpoint/VersionEndpointTask.java#L88].
 I will have a patch for this soon.


was (Author: erose):
I believe I found the problem. The disk check is incorrectly started right when 
the datanode starts up, without waiting for the version endpoint task and 
volume checks to finish initializing the volumes. This differs from the 
container scanner whose startup is correctly delayed until the version endpoint 
task completes. The log shows volume checks being scheduled before the end 
point task completes:
{code:java}
$ grep -e 'ERROR.*RunningDatanodeState' -e 'Scheduling a check for' *.log
2023-07-15 10:06:38,026 [main] INFO 
org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker: 
Scheduling a check for /hadoop/ozone/data/disk1/datanode/data/hdds
2023-07-15 10:06:38,039 [main] INFO 
org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker: 
Scheduling a check for /hadoop/ozone/data/disk1/datanode
2023-07-15 10:06:43,544 [Datanode State Machine Daemon Thread] ERROR 
org.apache.hadoop.ozone.container.common.states.datanode.RunningDatanodeState: 
Error in executing end point task.
2023-07-15 10:06:53,914 [Datanode State Machine Daemon Thread] ERROR 
org.apache.hadoop.ozone.container.common.states.datanode.RunningDatanodeState: 
Error in executing end point task.
{code}
The bug looks to have existed since HDDS-5268, but it is only showing up now 
after HDDS-8782 because previously the volume checks were only doing trivial 
checks on the top level volume directory.

Repro would require the perfect set of events, which seems to have happened 
here:
 * Datanode takes a longer time connecting to SCM when it is started for the 
very first time only.
 ** This gives the volume scanner a chance to run before the endpoint task 
completes volume initialization.
 ** This is likely why the issue could not be reproduced in docker.
 * Datanode working directory is not writeable
 ** When the [parent directory of the disk check files is 
null|https://github.com/apache/ozone/blob/153659032bcdc177f7a0e0afc6ff97c06fc2a76a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java#L134C1-L134C1]
 because the volume is not initialized, The disk check file becomes a relative 
path and it is written to the datanode working directory instead.
 ** If the datanode can write to this directory, the error would probably be 
silent and correct itself once volume init finishes.
 * Disk check gap is set to 0, so volume scans are scheduled immediately on 
startup.

The fix should be to avoid starting the volume checks in the [OzoneContainer 
constructor|https://github.com/apache/ozone/blob/2242f5ca91ac802f3d5ddaca1c9f7389012f8227/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java#L144]
 and instead move them to 
[OzoneContainer#start|https://github.com/apache/ozone/blob/2242f5ca91ac802f3d5ddaca1c9f7389012f8227/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java#L400]
 like the container scanners, which is called [after the version end point task 
finishes|https://github.com/apache/ozone/blob/956385a4496a634bd3f5232e7201cc9c1de8db22/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/states/endpoint/VersionEndpointTask.java#L88].
 I will have a patch for this soon.

> DiskChecker incorrectly reporting errors
> ----------------------------------------
>
>                 Key: HDDS-9022
>                 URL: https://issues.apache.org/jira/browse/HDDS-9022
>             Project: Apache Ozone
>          Issue Type: Bug
>    Affects Versions: 1.3.0
>            Reporter: Mladjan Gadzic
>            Assignee: Ethan Rose
>            Priority: Major
>         Attachments: dn1.log
>
>
> During load test of an aws based ozone cluster, we see that datanodes are 
> shutting down. We believe it is caused by the new DiskChecker incorrectly 
> reporting errors.
> It happens because the class for StorageVolume's runs the check method 
> without initializing the diskCheckDir field.
> Thanks [~xBis]  for research and notes on it!
> Diff showing new log messages:
> {code:java}
> diff --git 
> a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
>  
> b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
> index b267b1d47..1cb0d0085 100644
> --- 
> a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
> +++ 
> b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
> @@ -37,6 +37,8 @@
>   * where the disk is mounted.
>   */
>  public final class DiskCheckUtil {
> +  private static final Logger LOG =
> +      LoggerFactory.getLogger(DiskCheckUtil.class);
>    private DiskCheckUtil() { }
>  
>    // For testing purposes, an alternate check implementation can be provided
> @@ -63,6 +65,20 @@ public static boolean checkPermissions(File storageDir) {
>  
>    public static boolean checkReadWrite(File storageDir, File testFileDir,
>        int numBytesToWrite) {
> +    if (storageDir == null) {
> +      LOG.info("###storageDir is null. Printing stack trace: {}",
> +          Arrays.toString(new NullPointerException().getStackTrace()));
> +    } else {
> +      LOG.info("###storageDir path={}", storageDir.getPath());
> +    }
> +
> +    if (testFileDir == null) {
> +      LOG.info("###testFileDir is null. Printing stack trace: {}",
> +          Arrays.toString(new NullPointerException().getStackTrace()));
> +    } else {
> +      LOG.info("###testFileDir path={}", testFileDir.getPath());
> +    }
> +
>      return impl.checkReadWrite(storageDir, testFileDir, numBytesToWrite);
>    }
>   {code}
>  
> Stacktrace (specifically look at lines with "###")
> {code:java}
> 2023-07-15 10:07:38,006 [Periodic HDDS volume checker] INFO 
> org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker: 
> Scheduling a check for /hadoop/ozone/data/disk1/datanode/data/hdds
> 2023-07-15 10:07:38,007 [Periodic HDDS volume checker] INFO 
> org.apache.hadoop.ozone.container.common.volume.StorageVolumeChecker: 
> Scheduled health check for volume /hadoop/ozone/data/disk1/datanode/data/hdds
> 2023-07-15 10:07:38,007 [DataNode DiskChecker thread 0] INFO 
> org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###storageDir 
> path=/hadoop/ozone/data/disk1/datanode/data/hdds
> 2023-07-15 10:07:38,007 [DataNode DiskChecker thread 0] INFO 
> org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###testFileDir 
> path=/hadoop/ozone/data/disk1/datanode/data/hdds/CID-2eb5a782-379b-46c7-8bd1-8b19043c1a6e/tmp/disk-check
> 2023-07-15 10:07:38,010 [Periodic HDDS volume checker] INFO 
> org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker: 
> Scheduling a check for /hadoop/ozone/data/disk1/datanode
> 2023-07-15 10:07:38,010 [Periodic HDDS volume checker] INFO 
> org.apache.hadoop.ozone.container.common.volume.StorageVolumeChecker: 
> Scheduled health check for volume /hadoop/ozone/data/disk1/datanode
> 2023-07-15 10:07:38,010 [DataNode DiskChecker thread 0] INFO 
> org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###storageDir 
> path=/hadoop/ozone/data/disk1/datanode
> 2023-07-15 10:07:38,011 [DataNode DiskChecker thread 0] INFO 
> org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: ###testFileDir 
> is null. Printing stack trace: 
> [org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil.checkReadWrite(DiskCheckUtil.java:76),
>  
> org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:629),
>  
> org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:68),
>  
> org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker.lambda$schedule$0(ThrottledAsyncChecker.java:143),
>  
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131),
>  
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75),
>  
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82),
>  
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149),
>  
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624),
>  java.lang.Thread.run(Thread.java:750)]
> 2023-07-15 10:07:38,011 [DataNode DiskChecker thread 0] ERROR 
> org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil: Volume 
> /hadoop/ozone/data/disk1/datanode failed health check. Could not find file 
> disk-check-c8f5b40b-0cf6-420e-bbc3-9fb59def11eb for volume check.
> java.io.FileNotFoundException: 
> disk-check-c8f5b40b-0cf6-420e-bbc3-9fb59def11eb (No such file or directory)
>         at java.io.FileOutputStream.open0(Native Method)
>         at java.io.FileOutputStream.open(FileOutputStream.java:270)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
>         at 
> org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil$DiskChecksImpl.checkReadWrite(DiskCheckUtil.java:153)
>         at 
> org.apache.hadoop.ozone.container.common.utils.DiskCheckUtil.checkReadWrite(DiskCheckUtil.java:82)
>         at 
> org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:629)
>         at 
> org.apache.hadoop.ozone.container.common.volume.StorageVolume.check(StorageVolume.java:68)
>         at 
> org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker.lambda$schedule$0(ThrottledAsyncChecker.java:143)
>         at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
>         at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75)
>         at 
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:750)
> {code}
>  
> Bug could not be reproduced using Docker cluster, only on real cluster. Steps 
> to reproduce it:
>  # Add/modify properties as such: 
> {noformat}
> OZONE-SITE.XML_hdds.datanode.periodic.disk.check.interval.minutes=1 
> OZONE-SITE.XML_hdds.datanode.disk.check.min.gap=0s{noformat}
>  # Wait for volume to fail health check
> Check attached log file for more context.
> Suggested workaround diff:
> {code:java}
> diff --git 
> a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
>  
> b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
> index 95d1b2c2d..634b15a8e 100644
> --- 
> a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
> +++ 
> b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
> @@ -624,6 +624,10 @@ public synchronized VolumeCheckResult check(@Nullable 
> Boolean unused)
>        return VolumeCheckResult.HEALTHY;
>      }
>  
> +    if (diskCheckDir == null) {
> +      diskCheckDir = storageDir;
> +    }
> +
>      // Since IO errors may be intermittent, volume remains healthy until the
>      // threshold of failures is crossed.
>      boolean diskChecksPassed = DiskCheckUtil.checkReadWrite(storageDir,
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

[jira] [Comment Edited] (HDDS-9022) DiskChecker incorrectly reporting errors

Reply via email to