[GitHub] [hadoop-ozone] runzhiwang commented on a change in pull request #1185: HDDS-3933. Fix memory leak because of too many Datanode State Machine Thread
runzhiwang commented on a change in pull request #1185: URL: https://github.com/apache/hadoop-ozone/pull/1185#discussion_r458571774 ## File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/StateContext.java ## @@ -434,9 +438,14 @@ public void execute(ExecutorService service, long time, TimeUnit unit) if (isThreadPoolAvailable(service)) { task.execute(service); +threadPoolNotAvailableCount.set(0); } else { -LOG.warn("No available thread in pool, duration:" + -unit.toMillis(time)); +threadPoolNotAvailableCount.incrementAndGet(); Review comment: Updated This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[GitHub] [hadoop-ozone] runzhiwang commented on a change in pull request #1185: HDDS-3933. Fix memory leak because of too many Datanode State Machine Thread
runzhiwang commented on a change in pull request #1185: URL: https://github.com/apache/hadoop-ozone/pull/1185#discussion_r457074883 ## File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/StateContext.java ## @@ -415,7 +431,14 @@ public void execute(ExecutorService service, long time, TimeUnit unit) if (this.isEntering()) { task.onEnter(); } - task.execute(service); + + if (isThreadPoolAvailable(service)) { +task.execute(service); + } else { +LOG.warn("No available thread in pool, duration:" + Review comment: @ChenSammi @xiaoyuyao Thanks for review. I have updated the PR, because EndpointStateMachine#logIfNeeded can not be accessed there, and it's can not be refactored to use, so I did not use it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[GitHub] [hadoop-ozone] runzhiwang commented on a change in pull request #1185: HDDS-3933. Fix memory leak because of too many Datanode State Machine Thread
runzhiwang commented on a change in pull request #1185: URL: https://github.com/apache/hadoop-ozone/pull/1185#discussion_r456195255 ## File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/StateContext.java ## @@ -415,7 +431,14 @@ public void execute(ExecutorService service, long time, TimeUnit unit) if (this.isEntering()) { task.onEnter(); } - task.execute(service); + + if (isThreadPoolAvailable(service)) { +task.execute(service); + } else { +LOG.warn("No available thread in pool, duration:" + Review comment: @xiaoyuyao Thanks for review and reasonable suggestion. How about add metrics for thread pool available and unavailable times ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[GitHub] [hadoop-ozone] runzhiwang commented on a change in pull request #1185: HDDS-3933. Fix memory leak because of too many Datanode State Machine Thread
runzhiwang commented on a change in pull request #1185: URL: https://github.com/apache/hadoop-ozone/pull/1185#discussion_r454900546 ## File path: hadoop-hdds/common/src/main/resources/ozone-default.xml ## @@ -268,6 +268,13 @@ datanode periodically send node report to SCM. Unit could be defined with postfix (ns,ms,s,m,h,d) + +hdds.statemachine.endpoint.task.thread.count +2 +OZONE, DATANODE, MANAGEMENT +Maximum number of threads in the thread pool that Datanode + will use for GETVERSION, REGISTER, HEARTBEAT to SCMs. + Review comment: I remove this config, and get thread pool size by add number of scm and recon. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[GitHub] [hadoop-ozone] runzhiwang commented on a change in pull request #1185: HDDS-3933. Fix memory leak because of too many Datanode State Machine Thread
runzhiwang commented on a change in pull request #1185: URL: https://github.com/apache/hadoop-ozone/pull/1185#discussion_r454194380 ## File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/states/datanode/RunningDatanodeState.java ## @@ -200,22 +203,31 @@ public void execute(ExecutorService executor) { @Override public DatanodeStateMachine.DatanodeStates await(long duration, TimeUnit timeUnit) - throws InterruptedException, ExecutionException, TimeoutException { + throws InterruptedException { int count = connectionManager.getValues().size(); int returned = 0; -long timeLeft = timeUnit.toMillis(duration); +long durationMS = timeUnit.toMillis(duration); long startTime = Time.monotonicNow(); -List> results = new LinkedList<>(); - -while (returned < count && timeLeft > 0) { - Future result = - ecs.poll(timeLeft, TimeUnit.MILLISECONDS); - if (result != null) { -results.add(result); -returned++; +Set> results = new HashSet<>(); + +while (returned < count +&& (durationMS - (Time.monotonicNow() - startTime))> 0) { + for (Future future : futures) { +if (future != null && future.isDone() && !results.contains(future)) { + results.add(future); + returned++; +} } - timeLeft = timeLeft - (Time.monotonicNow() - startTime); + + Thread.sleep(durationMS / 10); } + +for (Future future : futures) { + if (future != null && !future.isDone()) { +future.cancel(true); Review comment: Maybe the future can not finish forever, because the rpc call blocked forever, as we found in production cluster. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[GitHub] [hadoop-ozone] runzhiwang commented on a change in pull request #1185: HDDS-3933. Fix memory leak because of too many Datanode State Machine Thread
runzhiwang commented on a change in pull request #1185: URL: https://github.com/apache/hadoop-ozone/pull/1185#discussion_r454194380 ## File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/states/datanode/RunningDatanodeState.java ## @@ -200,22 +203,31 @@ public void execute(ExecutorService executor) { @Override public DatanodeStateMachine.DatanodeStates await(long duration, TimeUnit timeUnit) - throws InterruptedException, ExecutionException, TimeoutException { + throws InterruptedException { int count = connectionManager.getValues().size(); int returned = 0; -long timeLeft = timeUnit.toMillis(duration); +long durationMS = timeUnit.toMillis(duration); long startTime = Time.monotonicNow(); -List> results = new LinkedList<>(); - -while (returned < count && timeLeft > 0) { - Future result = - ecs.poll(timeLeft, TimeUnit.MILLISECONDS); - if (result != null) { -results.add(result); -returned++; +Set> results = new HashSet<>(); + +while (returned < count +&& (durationMS - (Time.monotonicNow() - startTime))> 0) { + for (Future future : futures) { +if (future != null && future.isDone() && !results.contains(future)) { + results.add(future); + returned++; +} } - timeLeft = timeLeft - (Time.monotonicNow() - startTime); + + Thread.sleep(durationMS / 10); } + +for (Future future : futures) { + if (future != null && !future.isDone()) { +future.cancel(true); Review comment: Maybe the future can not finish forever, because the rpc call blocked forever, as we found in production cluster. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org