[ 
https://issues.apache.org/jira/browse/HDFS-17102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-17102:
----------------------------------
    Labels: pull-request-available  (was: )

> Timeout encountered when running TestDataNodeOutlierDetectionViaMetrics
> -----------------------------------------------------------------------
>
>                 Key: HDFS-17102
>                 URL: https://issues.apache.org/jira/browse/HDFS-17102
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: ConfX
>            Priority: Critical
>              Labels: pull-request-available
>         Attachments: reproduce.sh
>
>
> h2. What happened:
> Got a timeout when running {{TestDataNodeOutlierDetectionViaMetrics}} and 
> setting min outlier to 0 or negative.
> h2. Where's the bug:
> In {{TestDataNodeOutlierDetectionViaMetrics.injectFastNodesSamples}} the test 
> injects several packets into the nodes:
> {noformat}
>       for (int i = 0;
>            i < 2 * peerMetrics.getMinOutlierDetectionSamples();
>            ++i) {
>         peerMetrics.addSendPacketDownstream(
>             nodeName, random.nextInt(FAST_NODE_MAX_LATENCY_MS));
>       }{noformat}
> A similar logic appears in the {{{}injectSlowNodesSamples{}}}. A problem with 
> this code is that if 
> {{dfs.datanode.peer.metrics.min.outlier.detection.samples}} is set to 
> negative or 0, no packet would be injected and the {{waitFor}} later:
> {noformat}
>     GenericTestUtils.waitFor(new Supplier<Boolean>() {
>       @Override
>       public Boolean get() {
>         return peerMetrics.getOutliers().size() > 0;
>       }
>     }, 500, 100_000);{noformat}
> would keeping waiting until timeout.
> h2. How to reproduce:
> (1) Set {{dfs.datanode.peer.metrics.min.outlier.detection.samples }} to {{0}}
> (2) Run test: 
> {{org.apache.hadoop.hdfs.server.datanode.metrics.TestDataNodeOutlierDetectionViaMetrics#testOutlierIsDetected}}
> h2. Stacktrace:
>  
> {noformat}
> java.util.concurrent.TimeoutException:
> Timed out waiting for condition.
> Thread diagnostics:
> Timestamp: 2023-07-04 04:08:54,535
> "Reference Handler" daemon prio=10 tid=2 runnable
> java.lang.Thread.State: RUNNABLE
>         at 
> java.base@11.0.18/java.lang.ref.Reference.waitForReferencePendingList(Native 
> Method)
>         at 
> java.base@11.0.18/java.lang.ref.Reference.processPendingReferences(Reference.java:241)
>         at 
> java.base@11.0.18/java.lang.ref.Reference$ReferenceHandler.run(Reference.java:213)
> "surefire-forkedjvm-command-thread" daemon prio=5 tid=23 runnable
> java.lang.Thread.State: RUNNABLE
> ...
> {noformat}
> For an easy reproduction, run the reproduce.sh in the attachment.
> We are happy to provide a patch if this issue is confirmed.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to