[GitHub] [hadoop-ozone] runzhiwang commented on a change in pull request #1185: HDDS-3933. Fix memory leak because of too many Datanode State Machine Thread

2020-07-22 Thread GitBox


runzhiwang commented on a change in pull request #1185:
URL: https://github.com/apache/hadoop-ozone/pull/1185#discussion_r458571774



##
File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/StateContext.java
##
@@ -434,9 +438,14 @@ public void execute(ExecutorService service, long time, 
TimeUnit unit)
 
   if (isThreadPoolAvailable(service)) {
 task.execute(service);
+threadPoolNotAvailableCount.set(0);
   } else {
-LOG.warn("No available thread in pool, duration:" +
-unit.toMillis(time));
+threadPoolNotAvailableCount.incrementAndGet();

Review comment:
   Updated





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[GitHub] [hadoop-ozone] runzhiwang commented on a change in pull request #1185: HDDS-3933. Fix memory leak because of too many Datanode State Machine Thread

2020-07-19 Thread GitBox


runzhiwang commented on a change in pull request #1185:
URL: https://github.com/apache/hadoop-ozone/pull/1185#discussion_r457074883



##
File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/StateContext.java
##
@@ -415,7 +431,14 @@ public void execute(ExecutorService service, long time, 
TimeUnit unit)
   if (this.isEntering()) {
 task.onEnter();
   }
-  task.execute(service);
+
+  if (isThreadPoolAvailable(service)) {
+task.execute(service);
+  } else {
+LOG.warn("No available thread in pool, duration:" +

Review comment:
   @ChenSammi @xiaoyuyao  Thanks for review. I have updated the PR, because 
 EndpointStateMachine#logIfNeeded can not be accessed there, and it's can not 
be refactored to use, so I did not use it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[GitHub] [hadoop-ozone] runzhiwang commented on a change in pull request #1185: HDDS-3933. Fix memory leak because of too many Datanode State Machine Thread

2020-07-16 Thread GitBox


runzhiwang commented on a change in pull request #1185:
URL: https://github.com/apache/hadoop-ozone/pull/1185#discussion_r456195255



##
File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/StateContext.java
##
@@ -415,7 +431,14 @@ public void execute(ExecutorService service, long time, 
TimeUnit unit)
   if (this.isEntering()) {
 task.onEnter();
   }
-  task.execute(service);
+
+  if (isThreadPoolAvailable(service)) {
+task.execute(service);
+  } else {
+LOG.warn("No available thread in pool, duration:" +

Review comment:
   @xiaoyuyao Thanks for review and reasonable suggestion. How about add 
metrics for thread pool available and unavailable times ?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[GitHub] [hadoop-ozone] runzhiwang commented on a change in pull request #1185: HDDS-3933. Fix memory leak because of too many Datanode State Machine Thread

2020-07-15 Thread GitBox


runzhiwang commented on a change in pull request #1185:
URL: https://github.com/apache/hadoop-ozone/pull/1185#discussion_r454900546



##
File path: hadoop-hdds/common/src/main/resources/ozone-default.xml
##
@@ -268,6 +268,13 @@
   datanode periodically send node report to SCM. Unit could be
   defined with postfix (ns,ms,s,m,h,d)
   
+  
+hdds.statemachine.endpoint.task.thread.count
+2
+OZONE, DATANODE, MANAGEMENT
+Maximum number of threads in the thread pool that Datanode
+  will use for GETVERSION, REGISTER, HEARTBEAT to SCMs.
+  

Review comment:
   I remove this config, and get thread pool size by add number of scm and 
recon.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[GitHub] [hadoop-ozone] runzhiwang commented on a change in pull request #1185: HDDS-3933. Fix memory leak because of too many Datanode State Machine Thread

2020-07-14 Thread GitBox


runzhiwang commented on a change in pull request #1185:
URL: https://github.com/apache/hadoop-ozone/pull/1185#discussion_r454194380



##
File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/states/datanode/RunningDatanodeState.java
##
@@ -200,22 +203,31 @@ public void execute(ExecutorService executor) {
   @Override
   public DatanodeStateMachine.DatanodeStates
   await(long duration, TimeUnit timeUnit)
-  throws InterruptedException, ExecutionException, TimeoutException {
+  throws InterruptedException {
 int count = connectionManager.getValues().size();
 int returned = 0;
-long timeLeft = timeUnit.toMillis(duration);
+long durationMS = timeUnit.toMillis(duration);
 long startTime = Time.monotonicNow();
-List> results = new LinkedList<>();
-
-while (returned < count && timeLeft > 0) {
-  Future result =
-  ecs.poll(timeLeft, TimeUnit.MILLISECONDS);
-  if (result != null) {
-results.add(result);
-returned++;
+Set> results = new HashSet<>();
+
+while (returned < count
+&& (durationMS - (Time.monotonicNow() - startTime))> 0) {
+  for (Future future : futures) {
+if (future != null && future.isDone() && !results.contains(future)) {
+  results.add(future);
+  returned++;
+}
   }
-  timeLeft = timeLeft - (Time.monotonicNow() - startTime);
+
+  Thread.sleep(durationMS / 10);
 }
+
+for (Future future : futures) {
+  if (future != null && !future.isDone()) {
+future.cancel(true);

Review comment:
   Maybe the future can not finish forever, because the rpc call blocked 
forever, as we found in production cluster.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[GitHub] [hadoop-ozone] runzhiwang commented on a change in pull request #1185: HDDS-3933. Fix memory leak because of too many Datanode State Machine Thread

2020-07-14 Thread GitBox


runzhiwang commented on a change in pull request #1185:
URL: https://github.com/apache/hadoop-ozone/pull/1185#discussion_r454194380



##
File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/states/datanode/RunningDatanodeState.java
##
@@ -200,22 +203,31 @@ public void execute(ExecutorService executor) {
   @Override
   public DatanodeStateMachine.DatanodeStates
   await(long duration, TimeUnit timeUnit)
-  throws InterruptedException, ExecutionException, TimeoutException {
+  throws InterruptedException {
 int count = connectionManager.getValues().size();
 int returned = 0;
-long timeLeft = timeUnit.toMillis(duration);
+long durationMS = timeUnit.toMillis(duration);
 long startTime = Time.monotonicNow();
-List> results = new LinkedList<>();
-
-while (returned < count && timeLeft > 0) {
-  Future result =
-  ecs.poll(timeLeft, TimeUnit.MILLISECONDS);
-  if (result != null) {
-results.add(result);
-returned++;
+Set> results = new HashSet<>();
+
+while (returned < count
+&& (durationMS - (Time.monotonicNow() - startTime))> 0) {
+  for (Future future : futures) {
+if (future != null && future.isDone() && !results.contains(future)) {
+  results.add(future);
+  returned++;
+}
   }
-  timeLeft = timeLeft - (Time.monotonicNow() - startTime);
+
+  Thread.sleep(durationMS / 10);
 }
+
+for (Future future : futures) {
+  if (future != null && !future.isDone()) {
+future.cancel(true);

Review comment:
   Maybe the future can not finish forever, because the rpc call blocked 
forever, as we found in production cluster.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org