[ https://issues.apache.org/jira/browse/HDFS-16671?focusedWorklogId=794399&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-794399 ]
ASF GitHub Bot logged work on HDFS-16671: ----------------------------------------- Author: ASF GitHub Bot Created on: 22/Jul/22 22:17 Start Date: 22/Jul/22 22:17 Worklog Time Spent: 10m Work Description: ayushtkn commented on code in PR #4597: URL: https://github.com/apache/hadoop/pull/4597#discussion_r928012699 ########## hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/router/RBFConfigKeys.java: ########## @@ -354,6 +354,10 @@ public class RBFConfigKeys extends CommonConfigurationKeysPublic { NoRouterRpcFairnessPolicyController.class; public static final String DFS_ROUTER_FAIR_HANDLER_COUNT_KEY_PREFIX = FEDERATION_ROUTER_FAIRNESS_PREFIX + "handler.count."; + public static final String DFS_ROUTER_FAIRNESS_ACQUIRE_TIMEOUT = + FEDERATION_ROUTER_FAIRNESS_PREFIX + "acquire.timeout"; + public static final long DFS_ROUTER_FAIRNESS_ACQUIRE_TIMEOUT_MS_DEFAULT = Review Comment: The variable name should be similar to the original config name + ``_DEFAULT``, remove the ``_MS_`` in middle. Should be ``DFS_ROUTER_FAIRNESS_ACQUIRE_TIMEOUT_DEFAULT`` ########## hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/fairness/AbstractRouterRpcFairnessPolicyController.java: ########## @@ -42,15 +45,22 @@ /** Hash table to hold semaphore for each configured name service. */ private Map<String, Semaphore> permits; + private long acquireTimeoutMs = DFS_ROUTER_FAIRNESS_ACQUIRE_TIMEOUT_MS_DEFAULT; + public void init(Configuration conf) { this.permits = new HashMap<>(); + long timeoutMs = conf.getTimeDuration(DFS_ROUTER_FAIRNESS_ACQUIRE_TIMEOUT, + DFS_ROUTER_FAIRNESS_ACQUIRE_TIMEOUT_MS_DEFAULT, TimeUnit.MILLISECONDS); + if (timeoutMs >= 0) { + acquireTimeoutMs = timeoutMs; Review Comment: if there is an invalid entry configured and we are moving to using the default value. We should atleast have a warn log. Kind of ``Invalid value -1 configured for dfs.... should be greater than or equal to 0, Using default value of : 1s instead`` something like this ########## hadoop-hdfs-project/hadoop-hdfs-rbf/src/test/java/org/apache/hadoop/hdfs/server/federation/fairness/TestRouterRpcFairnessPolicyController.java: ########## @@ -83,6 +87,29 @@ public void testHandlerAllocationPreconfigured() { assertFalse(routerRpcFairnessPolicyController.acquirePermit(CONCURRENT_NS)); } + @Test + public void testAcquireTimeout() { + Configuration conf = createConf(40); + conf.setInt(DFS_ROUTER_FAIR_HANDLER_COUNT_KEY_PREFIX + "ns1", 30); + conf.setTimeDuration(DFS_ROUTER_FAIRNESS_ACQUIRE_TIMEOUT, 100, TimeUnit.MILLISECONDS); + RouterRpcFairnessPolicyController routerRpcFairnessPolicyController = + FederationUtil.newFairnessPolicyController(conf); + + // ns1 should have 30 permits allocated + for (int i = 0; i < 30; i++) { + assertTrue(routerRpcFairnessPolicyController.acquirePermit("ns1")); + } + long acquireBeginTimeMs = Time.monotonicNow(); + assertFalse(routerRpcFairnessPolicyController.acquirePermit("ns1")); + long acquireEndTimeMs = Time.monotonicNow(); + + long acquireTimeMs = acquireEndTimeMs - acquireBeginTimeMs; + + // There are some other operations, so acquireTimeMs >= 100ms. + assertTrue(acquireTimeMs >= 100); + assertTrue(acquireTimeMs < 100 + 50); Review Comment: Is this 50 added as safe margin, kind of that other operations won't take more than 50ms? In that case I doubt this test is gonna go flaky in future, Double check to confirm it isn't gonna cross the safe limit of 50 ever. Issue Time Tracking ------------------- Worklog Id: (was: 794399) Time Spent: 1.5h (was: 1h 20m) > RBF: RouterRpcFairnessPolicyController supports configurable permit acquire > timeout > ----------------------------------------------------------------------------------- > > Key: HDFS-16671 > URL: https://issues.apache.org/jira/browse/HDFS-16671 > Project: Hadoop HDFS > Issue Type: Improvement > Reporter: ZanderXu > Assignee: ZanderXu > Priority: Major > Labels: pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > > RouterRpcFairnessPolicyController supports configurable permit acquire > timeout. Hardcode 1s is very long, and it has causes an incident in our prod > environment when one nameserivce is busy. > And the optimal timeout maybe should be less than p50(avgTime). > And all handlers in RBF is waiting to acquire the permit of the busy ns. > {code:java} > "IPC Server handler 12 on default port 8888" #2370 daemon prio=5 os_prio=0 > tid=? nid=? waiting on condition [?] > java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <?> (a > java.util.concurrent.Semaphore$NonfairSync) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) > at java.util.concurrent.Semaphore.tryAcquire(Semaphore.java:409) > at > org.apache.hadoop.hdfs.server.federation.fairness.AbstractRouterRpcFairnessPolicyController.acquirePermit(AbstractRouterRpcFairnessPolicyController.java:56) > at > org.apache.hadoop.hdfs.server.federation.fairness.DynamicRouterRpcFairnessPolicyController.acquirePermit(DynamicRouterRpcFairnessPolicyController.java:123) > at > org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.acquirePermit(RouterRpcClient.java:1500) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org