[jira] [Commented] (HBASE-26679) Wait on the future returned by FanOutOneBlockAsyncDFSOutput.flush would stuck

2022-01-29 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17484237#comment-17484237
 ] 

Hudson commented on HBASE-26679:


Results for branch branch-2.5
[build #38 on 
builds.a.o|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.5/38/]:
 (x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.5/38/General_20Nightly_20Build_20Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.5/38/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.5/38/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Wait on the future returned by FanOutOneBlockAsyncDFSOutput.flush would stuck
> -
>
> Key: HBASE-26679
> URL: https://issues.apache.org/jira/browse/HBASE-26679
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 3.0.0-alpha-2, 2.4.9
>Reporter: chenglei
>Assignee: chenglei
>Priority: Critical
> Fix For: 2.5.0, 2.6.0, 3.0.0-alpha-3, 2.4.10
>
>
> Consider there are three dataNodes: dn1,dn2,and dn3, and we write some data 
> to {{FanOutOneBlockAsyncDFSOutput}} and then flush it, there are one 
> {{Callback}} in {{FanOutOneBlockAsyncDFSOutput.waitingAckQueue}}.  If the ack 
> from dn1 arrives firstly and triggers Netty to invoke 
> {{FanOutOneBlockAsyncDFSOutput.completed}} with dn1's channel, then in 
> {{FanOutOneBlockAsyncDFSOutput.completed}}, dn1's channel is removed from 
> {{Callback.unfinishedReplicas}}. 
> But dn2 and dn3 respond slowly, before dn2 and dn3 sending ack , dn1 is shut 
> down or have a exception, so {{FanOutOneBlockAsyncDFSOutput.failed}} is 
> triggered by Netty with dn1's channel, and because the 
> {{Callback.unfinishedReplicas}} does not contain dn1's channel,the 
> {{Callback}} is skipped in {{FanOutOneBlockAsyncDFSOutput.failed}} method, 
> just as following line250, and at line 245, 
> {{FanOutOneBlockAsyncDFSOutput.state}} is set to {{State.BROKEN}}.
> {code:java}
> 233  private synchronized void failed(Channel channel, Supplier 
> errorSupplier) {
> 234 if (state == State.BROKEN || state == State.CLOSED) {
> 235 return;
> 236  }
>  
> 244// disable further write, and fail all pending ack.
> 245state = State.BROKEN;
> 246Throwable error = errorSupplier.get();
> 247for (Iterator iter = waitingAckQueue.iterator(); 
> iter.hasNext();) {
> 248  Callback c = iter.next();
> 249  // find the first sync request which we have not acked yet and fail 
> all the request after it.
> 250  if (!c.unfinishedReplicas.contains(channel.id())) {
> 251continue;
> 252  }
> 253  for (;;) {
> 254c.future.completeExceptionally(error);
> 255if (!iter.hasNext()) {
> 256  break;
> 257}
> 258c = iter.next();
> 259  }
> 260break;
> 261}
> 262   datanodeInfoMap.keySet().forEach(ChannelOutboundInvoker::close);
> 263  }
> {code}
> At the end of above method in line 262, dn1,dn2 and dn3 are all closed, so 
> the {{FanOutOneBlockAsyncDFSOutput.failed}} is triggered again by dn2 and 
> dn3, but at the above line 234, because 
> {{FanOutOneBlockAsyncDFSOutput.state}}  is already {{State.BROKEN}}, the 
> whole  {{FanOutOneBlockAsyncDFSOutput.failed}}  is skipped. So wait on the 
> future returned by {{FanOutOneBlockAsyncDFSOutput.flush}} would stuck for 
> ever.
> When we roll the wal, we would create a new {{FanOutOneBlockAsyncDFSOutput}} 
> and a new {{AsyncProtobufLogWriter}}, in {{AsyncProtobufLogWriter.init}} we 
> write wal header to {{FanOutOneBlockAsyncDFSOutput}} and wait it to complete. 
> If we run into this situation, the roll would stuck forever.
> I have simulate this case in the PR, and my fix is even through the  
> {{FanOutOneBlockAsyncDFSOutput.state}}  is already {{State.BROKEN}}, we would 
> still try to trigger {{Callback.future}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26679) Wait on the future returned by FanOutOneBlockAsyncDFSOutput.flush would stuck

2022-01-29 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17484164#comment-17484164
 ] 

Hudson commented on HBASE-26679:


Results for branch master
[build #506 on 
builds.a.o|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/master/506/]:
 (x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/master/506/General_20Nightly_20Build_20Report/]






(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/master/506/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/master/506/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(x) {color:red}-1 source release artifact{color}
-- See build output for details.


(x) {color:red}-1 client integration test{color}
-- Something went wrong with this stage, [check relevant console 
output|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/master/506//console].


> Wait on the future returned by FanOutOneBlockAsyncDFSOutput.flush would stuck
> -
>
> Key: HBASE-26679
> URL: https://issues.apache.org/jira/browse/HBASE-26679
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 3.0.0-alpha-2, 2.4.9
>Reporter: chenglei
>Assignee: chenglei
>Priority: Critical
> Fix For: 2.5.0, 2.6.0, 3.0.0-alpha-3, 2.4.10
>
>
> Consider there are three dataNodes: dn1,dn2,and dn3, and we write some data 
> to {{FanOutOneBlockAsyncDFSOutput}} and then flush it, there are one 
> {{Callback}} in {{FanOutOneBlockAsyncDFSOutput.waitingAckQueue}}.  If the ack 
> from dn1 arrives firstly and triggers Netty to invoke 
> {{FanOutOneBlockAsyncDFSOutput.completed}} with dn1's channel, then in 
> {{FanOutOneBlockAsyncDFSOutput.completed}}, dn1's channel is removed from 
> {{Callback.unfinishedReplicas}}. 
> But dn2 and dn3 respond slowly, before dn2 and dn3 sending ack , dn1 is shut 
> down or have a exception, so {{FanOutOneBlockAsyncDFSOutput.failed}} is 
> triggered by Netty with dn1's channel, and because the 
> {{Callback.unfinishedReplicas}} does not contain dn1's channel,the 
> {{Callback}} is skipped in {{FanOutOneBlockAsyncDFSOutput.failed}} method, 
> just as following line250, and at line 245, 
> {{FanOutOneBlockAsyncDFSOutput.state}} is set to {{State.BROKEN}}.
> {code:java}
> 233  private synchronized void failed(Channel channel, Supplier 
> errorSupplier) {
> 234 if (state == State.BROKEN || state == State.CLOSED) {
> 235 return;
> 236  }
>  
> 244// disable further write, and fail all pending ack.
> 245state = State.BROKEN;
> 246Throwable error = errorSupplier.get();
> 247for (Iterator iter = waitingAckQueue.iterator(); 
> iter.hasNext();) {
> 248  Callback c = iter.next();
> 249  // find the first sync request which we have not acked yet and fail 
> all the request after it.
> 250  if (!c.unfinishedReplicas.contains(channel.id())) {
> 251continue;
> 252  }
> 253  for (;;) {
> 254c.future.completeExceptionally(error);
> 255if (!iter.hasNext()) {
> 256  break;
> 257}
> 258c = iter.next();
> 259  }
> 260break;
> 261}
> 262   datanodeInfoMap.keySet().forEach(ChannelOutboundInvoker::close);
> 263  }
> {code}
> At the end of above method in line 262, dn1,dn2 and dn3 are all closed, so 
> the {{FanOutOneBlockAsyncDFSOutput.failed}} is triggered again by dn2 and 
> dn3, but at the above line 234, because 
> {{FanOutOneBlockAsyncDFSOutput.state}}  is already {{State.BROKEN}}, the 
> whole  {{FanOutOneBlockAsyncDFSOutput.failed}}  is skipped. So wait on the 
> future returned by {{FanOutOneBlockAsyncDFSOutput.flush}} would stuck for 
> ever.
> When we roll the wal, we would create a new {{FanOutOneBlockAsyncDFSOutput}} 
> and a new {{AsyncProtobufLogWriter}}, in {{AsyncProtobufLogWriter.init}} we 
> write wal header to {{FanOutOneBlockAsyncDFSOutput}} and wait it to complete. 
> If we run into this situation, the roll would stuck forever.
> I have simulate this case in the PR, and my fix is even through the  
> {{FanOutOneBlockAsyncDFSOutput.state}}  is already {{State.BROKEN}}, we would 
> still try to trigger {{Callback.future}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26679) Wait on the future returned by FanOutOneBlockAsyncDFSOutput.flush would stuck

2022-01-28 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17484048#comment-17484048
 ] 

Hudson commented on HBASE-26679:


Results for branch branch-2.4
[build #279 on 
builds.a.o|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.4/279/]:
 (/) *{color:green}+1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.4/279/General_20Nightly_20Build_20Report/]




(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.4/279/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]


(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.4/279/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.4/279/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Wait on the future returned by FanOutOneBlockAsyncDFSOutput.flush would stuck
> -
>
> Key: HBASE-26679
> URL: https://issues.apache.org/jira/browse/HBASE-26679
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 3.0.0-alpha-2, 2.4.9
>Reporter: chenglei
>Assignee: chenglei
>Priority: Critical
> Fix For: 2.5.0, 2.6.0, 3.0.0-alpha-3, 2.4.10
>
>
> Consider there are three dataNodes: dn1,dn2,and dn3, and we write some data 
> to {{FanOutOneBlockAsyncDFSOutput}} and then flush it, there are one 
> {{Callback}} in {{FanOutOneBlockAsyncDFSOutput.waitingAckQueue}}.  If the ack 
> from dn1 arrives firstly and triggers Netty to invoke 
> {{FanOutOneBlockAsyncDFSOutput.completed}} with dn1's channel, then in 
> {{FanOutOneBlockAsyncDFSOutput.completed}}, dn1's channel is removed from 
> {{Callback.unfinishedReplicas}}. 
> But dn2 and dn3 respond slowly, before dn2 and dn3 sending ack , dn1 is shut 
> down or have a exception, so {{FanOutOneBlockAsyncDFSOutput.failed}} is 
> triggered by Netty with dn1's channel, and because the 
> {{Callback.unfinishedReplicas}} does not contain dn1's channel,the 
> {{Callback}} is skipped in {{FanOutOneBlockAsyncDFSOutput.failed}} method, 
> just as following line250, and at line 245, 
> {{FanOutOneBlockAsyncDFSOutput.state}} is set to {{State.BROKEN}}.
> {code:java}
> 233  private synchronized void failed(Channel channel, Supplier 
> errorSupplier) {
> 234 if (state == State.BROKEN || state == State.CLOSED) {
> 235 return;
> 236  }
>  
> 244// disable further write, and fail all pending ack.
> 245state = State.BROKEN;
> 246Throwable error = errorSupplier.get();
> 247for (Iterator iter = waitingAckQueue.iterator(); 
> iter.hasNext();) {
> 248  Callback c = iter.next();
> 249  // find the first sync request which we have not acked yet and fail 
> all the request after it.
> 250  if (!c.unfinishedReplicas.contains(channel.id())) {
> 251continue;
> 252  }
> 253  for (;;) {
> 254c.future.completeExceptionally(error);
> 255if (!iter.hasNext()) {
> 256  break;
> 257}
> 258c = iter.next();
> 259  }
> 260break;
> 261}
> 262   datanodeInfoMap.keySet().forEach(ChannelOutboundInvoker::close);
> 263  }
> {code}
> At the end of above method in line 262, dn1,dn2 and dn3 are all closed, so 
> the {{FanOutOneBlockAsyncDFSOutput.failed}} is triggered again by dn2 and 
> dn3, but at the above line 234, because 
> {{FanOutOneBlockAsyncDFSOutput.state}}  is already {{State.BROKEN}}, the 
> whole  {{FanOutOneBlockAsyncDFSOutput.failed}}  is skipped. So wait on the 
> future returned by {{FanOutOneBlockAsyncDFSOutput.flush}} would stuck for 
> ever.
> When we roll the wal, we would create a new {{FanOutOneBlockAsyncDFSOutput}} 
> and a new {{AsyncProtobufLogWriter}}, in {{AsyncProtobufLogWriter.init}} we 
> write wal header to {{FanOutOneBlockAsyncDFSOutput}} and wait it to complete. 
> If we run into this situation, the roll would stuck forever.
> I have simulate this case in the PR, and my fix is even through the  
> {{FanOutOneBlockAsyncDFSOutput.state}}  is already {{State.BROKEN}}, we would 
> still try to trigger {{Callback.future}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26679) Wait on the future returned by FanOutOneBlockAsyncDFSOutput.flush would stuck

2022-01-28 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17483892#comment-17483892
 ] 

Hudson commented on HBASE-26679:


Results for branch branch-2
[build #452 on 
builds.a.o|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2/452/]:
 (x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2/452/General_20Nightly_20Build_20Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2/452/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2/452/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(x) {color:red}-1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2/452/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Wait on the future returned by FanOutOneBlockAsyncDFSOutput.flush would stuck
> -
>
> Key: HBASE-26679
> URL: https://issues.apache.org/jira/browse/HBASE-26679
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 3.0.0-alpha-2, 2.4.9
>Reporter: chenglei
>Assignee: chenglei
>Priority: Critical
> Fix For: 2.5.0, 3.0.0-alpha-3, 2.4.10
>
>
> Consider there are three dataNodes: dn1,dn2,and dn3, and we write some data 
> to {{FanOutOneBlockAsyncDFSOutput}} and then flush it, there are one 
> {{Callback}} in {{FanOutOneBlockAsyncDFSOutput.waitingAckQueue}}.  If the ack 
> from dn1 arrives firstly and triggers Netty to invoke 
> {{FanOutOneBlockAsyncDFSOutput.completed}} with dn1's channel, then in 
> {{FanOutOneBlockAsyncDFSOutput.completed}}, dn1's channel is removed from 
> {{Callback.unfinishedReplicas}}. 
> But dn2 and dn3 respond slowly, before dn2 and dn3 sending ack , dn1 is shut 
> down or have a exception, so {{FanOutOneBlockAsyncDFSOutput.failed}} is 
> triggered by Netty with dn1's channel, and because the 
> {{Callback.unfinishedReplicas}} does not contain dn1's channel,the 
> {{Callback}} is skipped in {{FanOutOneBlockAsyncDFSOutput.failed}} method, 
> just as following line250, and at line 245, 
> {{FanOutOneBlockAsyncDFSOutput.state}} is set to {{State.BROKEN}}.
> {code:java}
> 233  private synchronized void failed(Channel channel, Supplier 
> errorSupplier) {
> 234 if (state == State.BROKEN || state == State.CLOSED) {
> 235 return;
> 236  }
>  
> 244// disable further write, and fail all pending ack.
> 245state = State.BROKEN;
> 246Throwable error = errorSupplier.get();
> 247for (Iterator iter = waitingAckQueue.iterator(); 
> iter.hasNext();) {
> 248  Callback c = iter.next();
> 249  // find the first sync request which we have not acked yet and fail 
> all the request after it.
> 250  if (!c.unfinishedReplicas.contains(channel.id())) {
> 251continue;
> 252  }
> 253  for (;;) {
> 254c.future.completeExceptionally(error);
> 255if (!iter.hasNext()) {
> 256  break;
> 257}
> 258c = iter.next();
> 259  }
> 260break;
> 261}
> 262   datanodeInfoMap.keySet().forEach(ChannelOutboundInvoker::close);
> 263  }
> {code}
> At the end of above method in line 262, dn1,dn2 and dn3 are all closed, so 
> the {{FanOutOneBlockAsyncDFSOutput.failed}} is triggered again by dn2 and 
> dn3, but at the above line 234, because 
> {{FanOutOneBlockAsyncDFSOutput.state}}  is already {{State.BROKEN}}, the 
> whole  {{FanOutOneBlockAsyncDFSOutput.failed}}  is skipped. So wait on the 
> future returned by {{FanOutOneBlockAsyncDFSOutput.flush}} would stuck for 
> ever.
> When we roll the wal, we would create a new {{FanOutOneBlockAsyncDFSOutput}} 
> and a new {{AsyncProtobufLogWriter}}, in {{AsyncProtobufLogWriter.init}} we 
> write wal header to {{FanOutOneBlockAsyncDFSOutput}} and wait it to complete. 
> If we run into this situation, the roll would stuck forever.
> I have simulate this case in the PR, and my fix is even through the  
> {{FanOutOneBlockAsyncDFSOutput.state}}  is already {{State.BROKEN}}, we would 
> still try to trigger {{Callback.future}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26679) Wait on the future returned by FanOutOneBlockAsyncDFSOutput.flush would stuck

2022-01-28 Thread chenglei (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17483764#comment-17483764
 ] 

chenglei commented on HBASE-26679:
--

[~zhangduo], thank you very much for your help.

> Wait on the future returned by FanOutOneBlockAsyncDFSOutput.flush would stuck
> -
>
> Key: HBASE-26679
> URL: https://issues.apache.org/jira/browse/HBASE-26679
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 3.0.0-alpha-2, 2.4.9
>Reporter: chenglei
>Assignee: chenglei
>Priority: Critical
> Fix For: 2.5.0, 3.0.0-alpha-3, 2.4.10
>
>
> Consider there are three dataNodes: dn1,dn2,and dn3, and we write some data 
> to {{FanOutOneBlockAsyncDFSOutput}} and then flush it, there are one 
> {{Callback}} in {{FanOutOneBlockAsyncDFSOutput.waitingAckQueue}}.  If the ack 
> from dn1 arrives firstly and triggers Netty to invoke 
> {{FanOutOneBlockAsyncDFSOutput.completed}} with dn1's channel, then in 
> {{FanOutOneBlockAsyncDFSOutput.completed}}, dn1's channel is removed from 
> {{Callback.unfinishedReplicas}}. 
> But dn2 and dn3 respond slowly, before dn2 and dn3 sending ack , dn1 is shut 
> down or have a exception, so {{FanOutOneBlockAsyncDFSOutput.failed}} is 
> triggered by Netty with dn1's channel, and because the 
> {{Callback.unfinishedReplicas}} does not contain dn1's channel,the 
> {{Callback}} is skipped in {{FanOutOneBlockAsyncDFSOutput.failed}} method, 
> just as following line250, and at line 245, 
> {{FanOutOneBlockAsyncDFSOutput.state}} is set to {{State.BROKEN}}.
> {code:java}
> 233  private synchronized void failed(Channel channel, Supplier 
> errorSupplier) {
> 234 if (state == State.BROKEN || state == State.CLOSED) {
> 235 return;
> 236  }
>  
> 244// disable further write, and fail all pending ack.
> 245state = State.BROKEN;
> 246Throwable error = errorSupplier.get();
> 247for (Iterator iter = waitingAckQueue.iterator(); 
> iter.hasNext();) {
> 248  Callback c = iter.next();
> 249  // find the first sync request which we have not acked yet and fail 
> all the request after it.
> 250  if (!c.unfinishedReplicas.contains(channel.id())) {
> 251continue;
> 252  }
> 253  for (;;) {
> 254c.future.completeExceptionally(error);
> 255if (!iter.hasNext()) {
> 256  break;
> 257}
> 258c = iter.next();
> 259  }
> 260break;
> 261}
> 262   datanodeInfoMap.keySet().forEach(ChannelOutboundInvoker::close);
> 263  }
> {code}
> At the end of above method in line 262, dn1,dn2 and dn3 are all closed, so 
> the {{FanOutOneBlockAsyncDFSOutput.failed}} is triggered again by dn2 and 
> dn3, but at the above line 234, because 
> {{FanOutOneBlockAsyncDFSOutput.state}}  is already {{State.BROKEN}}, the 
> whole  {{FanOutOneBlockAsyncDFSOutput.failed}}  is skipped. So wait on the 
> future returned by {{FanOutOneBlockAsyncDFSOutput.flush}} would stuck for 
> ever.
> When we roll the wal, we would create a new {{FanOutOneBlockAsyncDFSOutput}} 
> and a new {{AsyncProtobufLogWriter}}, in {{AsyncProtobufLogWriter.init}} we 
> write wal header to {{FanOutOneBlockAsyncDFSOutput}} and wait it to complete. 
> If we run into this situation, the roll would stuck forever.
> I have simulate this case in the PR, and my fix is even through the  
> {{FanOutOneBlockAsyncDFSOutput.state}}  is already {{State.BROKEN}}, we would 
> still try to trigger {{Callback.future}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26679) Wait on the future returned by FanOutOneBlockAsyncDFSOutput.flush would stuck

2022-01-28 Thread chenglei (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17483650#comment-17483650
 ] 

chenglei commented on HBASE-26679:
--

[~zhangduo],ok, I would try to make a PR for branch-2.4

> Wait on the future returned by FanOutOneBlockAsyncDFSOutput.flush would stuck
> -
>
> Key: HBASE-26679
> URL: https://issues.apache.org/jira/browse/HBASE-26679
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 3.0.0-alpha-2, 2.4.9
>Reporter: chenglei
>Assignee: chenglei
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-3, 2.4.10
>
>
> Consider there are three dataNodes: dn1,dn2,and dn3, and we write some data 
> to {{FanOutOneBlockAsyncDFSOutput}} and then flush it, there are one 
> {{Callback}} in {{FanOutOneBlockAsyncDFSOutput.waitingAckQueue}}.  If the ack 
> from dn1 arrives firstly and triggers Netty to invoke 
> {{FanOutOneBlockAsyncDFSOutput.completed}} with dn1's channel, then in 
> {{FanOutOneBlockAsyncDFSOutput.completed}}, dn1's channel is removed from 
> {{Callback.unfinishedReplicas}}. 
> But dn2 and dn3 respond slowly, before dn2 and dn3 sending ack , dn1 is shut 
> down or have a exception, so {{FanOutOneBlockAsyncDFSOutput.failed}} is 
> triggered by Netty with dn1's channel, and because the 
> {{Callback.unfinishedReplicas}} does not contain dn1's channel,the 
> {{Callback}} is skipped in {{FanOutOneBlockAsyncDFSOutput.failed}} method, 
> just as following line250, and at line 245, 
> {{FanOutOneBlockAsyncDFSOutput.state}} is set to {{State.BROKEN}}.
> {code:java}
> 233  private synchronized void failed(Channel channel, Supplier 
> errorSupplier) {
> 234 if (state == State.BROKEN || state == State.CLOSED) {
> 235 return;
> 236  }
>  
> 244// disable further write, and fail all pending ack.
> 245state = State.BROKEN;
> 246Throwable error = errorSupplier.get();
> 247for (Iterator iter = waitingAckQueue.iterator(); 
> iter.hasNext();) {
> 248  Callback c = iter.next();
> 249  // find the first sync request which we have not acked yet and fail 
> all the request after it.
> 250  if (!c.unfinishedReplicas.contains(channel.id())) {
> 251continue;
> 252  }
> 253  for (;;) {
> 254c.future.completeExceptionally(error);
> 255if (!iter.hasNext()) {
> 256  break;
> 257}
> 258c = iter.next();
> 259  }
> 260break;
> 261}
> 262   datanodeInfoMap.keySet().forEach(ChannelOutboundInvoker::close);
> 263  }
> {code}
> At the end of above method in line 262, dn1,dn2 and dn3 are all closed, so 
> the {{FanOutOneBlockAsyncDFSOutput.failed}} is triggered again by dn2 and 
> dn3, but at the above line 234, because 
> {{FanOutOneBlockAsyncDFSOutput.state}}  is already {{State.BROKEN}}, the 
> whole  {{FanOutOneBlockAsyncDFSOutput.failed}}  is skipped. So wait on the 
> future returned by {{FanOutOneBlockAsyncDFSOutput.flush}} would stuck for 
> ever.
> When we roll the wal, we would create a new {{FanOutOneBlockAsyncDFSOutput}} 
> and a new {{AsyncProtobufLogWriter}}, in {{AsyncProtobufLogWriter.init}} we 
> write wal header to {{FanOutOneBlockAsyncDFSOutput}} and wait it to complete. 
> If we run into this situation, the roll would stuck forever.
> I have simulate this case in the PR, and my fix is even through the  
> {{FanOutOneBlockAsyncDFSOutput.state}}  is already {{State.BROKEN}}, we would 
> still try to trigger {{Callback.future}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26679) Wait on the future returned by FanOutOneBlockAsyncDFSOutput.flush would stuck

2022-01-28 Thread Duo Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17483641#comment-17483641
 ] 

Duo Zhang commented on HBASE-26679:
---

[~comnetwork] Please open a PR for branch-2.4? We do not have datanodeInfoMap 
on branch-2.4 so for the test the changes are not straight forward...

> Wait on the future returned by FanOutOneBlockAsyncDFSOutput.flush would stuck
> -
>
> Key: HBASE-26679
> URL: https://issues.apache.org/jira/browse/HBASE-26679
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 3.0.0-alpha-2, 2.4.9
>Reporter: chenglei
>Assignee: chenglei
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-3, 2.4.10
>
>
> Consider there are three dataNodes: dn1,dn2,and dn3, and we write some data 
> to {{FanOutOneBlockAsyncDFSOutput}} and then flush it, there are one 
> {{Callback}} in {{FanOutOneBlockAsyncDFSOutput.waitingAckQueue}}.  If the ack 
> from dn1 arrives firstly and triggers Netty to invoke 
> {{FanOutOneBlockAsyncDFSOutput.completed}} with dn1's channel, then in 
> {{FanOutOneBlockAsyncDFSOutput.completed}}, dn1's channel is removed from 
> {{Callback.unfinishedReplicas}}. 
> But dn2 and dn3 respond slowly, before dn2 and dn3 sending ack , dn1 is shut 
> down or have a exception, so {{FanOutOneBlockAsyncDFSOutput.failed}} is 
> triggered by Netty with dn1's channel, and because the 
> {{Callback.unfinishedReplicas}} does not contain dn1's channel,the 
> {{Callback}} is skipped in {{FanOutOneBlockAsyncDFSOutput.failed}} method, 
> just as following line250, and at line 245, 
> {{FanOutOneBlockAsyncDFSOutput.state}} is set to {{State.BROKEN}}.
> {code:java}
> 233  private synchronized void failed(Channel channel, Supplier 
> errorSupplier) {
> 234 if (state == State.BROKEN || state == State.CLOSED) {
> 235 return;
> 236  }
>  
> 244// disable further write, and fail all pending ack.
> 245state = State.BROKEN;
> 246Throwable error = errorSupplier.get();
> 247for (Iterator iter = waitingAckQueue.iterator(); 
> iter.hasNext();) {
> 248  Callback c = iter.next();
> 249  // find the first sync request which we have not acked yet and fail 
> all the request after it.
> 250  if (!c.unfinishedReplicas.contains(channel.id())) {
> 251continue;
> 252  }
> 253  for (;;) {
> 254c.future.completeExceptionally(error);
> 255if (!iter.hasNext()) {
> 256  break;
> 257}
> 258c = iter.next();
> 259  }
> 260break;
> 261}
> 262   datanodeInfoMap.keySet().forEach(ChannelOutboundInvoker::close);
> 263  }
> {code}
> At the end of above method in line 262, dn1,dn2 and dn3 are all closed, so 
> the {{FanOutOneBlockAsyncDFSOutput.failed}} is triggered again by dn2 and 
> dn3, but at the above line 234, because 
> {{FanOutOneBlockAsyncDFSOutput.state}}  is already {{State.BROKEN}}, the 
> whole  {{FanOutOneBlockAsyncDFSOutput.failed}}  is skipped. So wait on the 
> future returned by {{FanOutOneBlockAsyncDFSOutput.flush}} would stuck for 
> ever.
> When we roll the wal, we would create a new {{FanOutOneBlockAsyncDFSOutput}} 
> and a new {{AsyncProtobufLogWriter}}, in {{AsyncProtobufLogWriter.init}} we 
> write wal header to {{FanOutOneBlockAsyncDFSOutput}} and wait it to complete. 
> If we run into this situation, the roll would stuck forever.
> I have simulate this case in the PR, and my fix is even through the  
> {{FanOutOneBlockAsyncDFSOutput.state}}  is already {{State.BROKEN}}, we would 
> still try to trigger {{Callback.future}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26679) Wait on the future returned by FanOutOneBlockAsyncDFSOutput.flush would stuck

2022-01-27 Thread Huaxiang Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17483364#comment-17483364
 ] 

Huaxiang Sun commented on HBASE-26679:
--

Very nice finding and analysis! We run into this issue before and Stack has the 
jira, HBASE-26041, link here.

> Wait on the future returned by FanOutOneBlockAsyncDFSOutput.flush would stuck
> -
>
> Key: HBASE-26679
> URL: https://issues.apache.org/jira/browse/HBASE-26679
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 3.0.0-alpha-2, 2.4.9
>Reporter: chenglei
>Assignee: chenglei
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-3, 2.4.10
>
>
> Consider there are three dataNodes: dn1,dn2,and dn3, and we write some data 
> to {{FanOutOneBlockAsyncDFSOutput}} and then flush it, there are one 
> {{Callback}} in {{FanOutOneBlockAsyncDFSOutput.waitingAckQueue}}.  If the ack 
> from dn1 arrives firstly and triggers Netty to invoke 
> {{FanOutOneBlockAsyncDFSOutput.completed}} with dn1's channel, then in 
> {{FanOutOneBlockAsyncDFSOutput.completed}}, dn1's channel is removed from 
> {{Callback.unfinishedReplicas}}. 
> But dn2 and dn3 respond slowly, before dn2 and dn3 sending ack , dn1 is shut 
> down or have a exception, so {{FanOutOneBlockAsyncDFSOutput.failed}} is 
> triggered by Netty with dn1's channel, and because the 
> {{Callback.unfinishedReplicas}} does not contain dn1's channel,the 
> {{Callback}} is skipped in {{FanOutOneBlockAsyncDFSOutput.failed}} method, 
> just as following line250, and at line 245, 
> {{FanOutOneBlockAsyncDFSOutput.state}} is set to {{State.BROKEN}}.
> {code:java}
> 233  private synchronized void failed(Channel channel, Supplier 
> errorSupplier) {
> 234 if (state == State.BROKEN || state == State.CLOSED) {
> 235 return;
> 236  }
>  
> 244// disable further write, and fail all pending ack.
> 245state = State.BROKEN;
> 246Throwable error = errorSupplier.get();
> 247for (Iterator iter = waitingAckQueue.iterator(); 
> iter.hasNext();) {
> 248  Callback c = iter.next();
> 249  // find the first sync request which we have not acked yet and fail 
> all the request after it.
> 250  if (!c.unfinishedReplicas.contains(channel.id())) {
> 251continue;
> 252  }
> 253  for (;;) {
> 254c.future.completeExceptionally(error);
> 255if (!iter.hasNext()) {
> 256  break;
> 257}
> 258c = iter.next();
> 259  }
> 260break;
> 261}
> 262   datanodeInfoMap.keySet().forEach(ChannelOutboundInvoker::close);
> 263  }
> {code}
> At the end of above method in line 262, dn1,dn2 and dn3 are all closed, so 
> the {{FanOutOneBlockAsyncDFSOutput.failed}} is triggered again by dn2 and 
> dn3, but at the above line 234, because 
> {{FanOutOneBlockAsyncDFSOutput.state}}  is already {{State.BROKEN}}, the 
> whole  {{FanOutOneBlockAsyncDFSOutput.failed}}  is skipped. So wait on the 
> future returned by {{FanOutOneBlockAsyncDFSOutput.flush}} would stuck for 
> ever.
> When we roll the wal, we would create a new {{FanOutOneBlockAsyncDFSOutput}} 
> and a new {{AsyncProtobufLogWriter}}, in {{AsyncProtobufLogWriter.init}} we 
> write wal header to {{FanOutOneBlockAsyncDFSOutput}} and wait it to complete. 
> If we run into this situation, the roll would stuck forever.
> I have simulate this case in the PR, and my fix is even through the  
> {{FanOutOneBlockAsyncDFSOutput.state}}  is already {{State.BROKEN}}, we would 
> still try to trigger {{Callback.future}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26679) Wait on the future returned by FanOutOneBlockAsyncDFSOutput.flush would stuck

2022-01-18 Thread Lijin Bin (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17478309#comment-17478309
 ] 

Lijin Bin commented on HBASE-26679:
---

Looks like the HBASE-26411 is just the problem. 

> Wait on the future returned by FanOutOneBlockAsyncDFSOutput.flush would stuck
> -
>
> Key: HBASE-26679
> URL: https://issues.apache.org/jira/browse/HBASE-26679
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 3.0.0-alpha-2, 2.4.9
>Reporter: chenglei
>Assignee: chenglei
>Priority: Major
>
> Consider there are three dataNodes: dn1,dn2,and dn3, and we write some data 
> to {{FanOutOneBlockAsyncDFSOutput}} and then flush it, there are one 
> {{Callback}} in {{FanOutOneBlockAsyncDFSOutput.waitingAckQueue}}.  If the ack 
> from dn1 arrives firstly and triggers Netty to invoke 
> {{FanOutOneBlockAsyncDFSOutput.completed}} with dn1's channel, then in 
> {{FanOutOneBlockAsyncDFSOutput.completed}}, dn1's channel is removed from 
> {{Callback.unfinishedReplicas}}. 
> But dn2 and dn3 respond slowly, before dn2 and dn3 sending ack , dn1 is shut 
> down or have a exception, so {{FanOutOneBlockAsyncDFSOutput.failed}} is 
> triggered by Netty with dn1's channel, and because the 
> {{Callback.unfinishedReplicas}} does not contain dn1's channel,the 
> {{Callback}} is skipped in {{FanOutOneBlockAsyncDFSOutput.failed}} method, 
> just as following line250, and at line 245, 
> {{FanOutOneBlockAsyncDFSOutput.state}} is set to {{State.BROKEN}}.
> {code:java}
> 233  private synchronized void failed(Channel channel, Supplier 
> errorSupplier) {
> 234 if (state == State.BROKEN || state == State.CLOSED) {
> 235 return;
> 236  }
>  
> 244// disable further write, and fail all pending ack.
> 245state = State.BROKEN;
> 246Throwable error = errorSupplier.get();
> 247for (Iterator iter = waitingAckQueue.iterator(); 
> iter.hasNext();) {
> 248  Callback c = iter.next();
> 249  // find the first sync request which we have not acked yet and fail 
> all the request after it.
> 250  if (!c.unfinishedReplicas.contains(channel.id())) {
> 251continue;
> 252  }
> 253  for (;;) {
> 254c.future.completeExceptionally(error);
> 255if (!iter.hasNext()) {
> 256  break;
> 257}
> 258c = iter.next();
> 259  }
> 260break;
> 261}
> 262   datanodeInfoMap.keySet().forEach(ChannelOutboundInvoker::close);
> 263  }
> {code}
> At the end of above method in line 262, dn1,dn2 and dn3 are all closed, so 
> the {{FanOutOneBlockAsyncDFSOutput.failed}} is triggered again by dn2 and 
> dn3, but at the above line 234, because 
> {{FanOutOneBlockAsyncDFSOutput.state}}  is already {{State.BROKEN}}, the 
> whole  {{FanOutOneBlockAsyncDFSOutput.failed}}  is skipped. So wait on the 
> future returned by {{FanOutOneBlockAsyncDFSOutput.flush}} would stuck for 
> ever.
> When we roll the wal, we would create a new {{FanOutOneBlockAsyncDFSOutput}} 
> and a new {{AsyncProtobufLogWriter}}, in {{AsyncProtobufLogWriter.init}} we 
> write wal header to {{FanOutOneBlockAsyncDFSOutput}} and wait it to complete. 
> If we run into this situation, the roll would stuck forever.
> I have simulate this case in the PR, and my fix is even through the  
> {{FanOutOneBlockAsyncDFSOutput.state}}  is already {{State.BROKEN}}, we would 
> still try to trigger {{Callback.future}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26679) Wait on the future returned by FanOutOneBlockAsyncDFSOutput.flush would stuck

2022-01-18 Thread chenglei (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477864#comment-17477864
 ] 

chenglei commented on HBASE-26679:
--

??But looking at the code, I do not think it can only be reproduced by the 
above scenario, as long as a DN responds faster than others and then fails, we 
can run into this situation and cause some future to be stuck forever.??
 Yes, I just take this scenario to illustrate this problem.

> Wait on the future returned by FanOutOneBlockAsyncDFSOutput.flush would stuck
> -
>
> Key: HBASE-26679
> URL: https://issues.apache.org/jira/browse/HBASE-26679
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 3.0.0-alpha-2, 2.4.9
>Reporter: chenglei
>Assignee: chenglei
>Priority: Major
>
> Consider there are three dataNodes: dn1,dn2,and dn3, and we write some data 
> to {{FanOutOneBlockAsyncDFSOutput}} and then flush it, there are one 
> {{Callback}} in {{FanOutOneBlockAsyncDFSOutput.waitingAckQueue}}.  If the ack 
> from dn1 arrives firstly and triggers Netty to invoke 
> {{FanOutOneBlockAsyncDFSOutput.completed}} with dn1's channel, then in 
> {{FanOutOneBlockAsyncDFSOutput.completed}}, dn1's channel is removed from 
> {{Callback.unfinishedReplicas}}. 
> But dn2 and dn3 respond slowly, before dn2 and dn3 sending ack , dn1 is shut 
> down or have a exception, so {{FanOutOneBlockAsyncDFSOutput.failed}} is 
> triggered by Netty with dn1's channel, and because the 
> {{Callback.unfinishedReplicas}} does not contain dn1's channel,the 
> {{Callback}} is skipped in {{FanOutOneBlockAsyncDFSOutput.failed}} method, 
> just as following line250, and at line 245, 
> {{FanOutOneBlockAsyncDFSOutput.state}} is set to {{State.BROKEN}}.
> {code:java}
> 233  private synchronized void failed(Channel channel, Supplier 
> errorSupplier) {
> 234 if (state == State.BROKEN || state == State.CLOSED) {
> 235 return;
> 236  }
>  
> 244// disable further write, and fail all pending ack.
> 245state = State.BROKEN;
> 246Throwable error = errorSupplier.get();
> 247for (Iterator iter = waitingAckQueue.iterator(); 
> iter.hasNext();) {
> 248  Callback c = iter.next();
> 249  // find the first sync request which we have not acked yet and fail 
> all the request after it.
> 250  if (!c.unfinishedReplicas.contains(channel.id())) {
> 251continue;
> 252  }
> 253  for (;;) {
> 254c.future.completeExceptionally(error);
> 255if (!iter.hasNext()) {
> 256  break;
> 257}
> 258c = iter.next();
> 259  }
> 260break;
> 261}
> 262   datanodeInfoMap.keySet().forEach(ChannelOutboundInvoker::close);
> 263  }
> {code}
> At the end of above method in line 262, dn1,dn2 and dn3 are all closed, so 
> the {{FanOutOneBlockAsyncDFSOutput.failed}} is triggered again by dn2 and 
> dn3, but at the above line 234, because 
> {{FanOutOneBlockAsyncDFSOutput.state}}  is already {{State.BROKEN}}, the 
> whole  {{FanOutOneBlockAsyncDFSOutput.failed}}  is skipped. So the wait on 
> the future returned by {{FanOutOneBlockAsyncDFSOutput.flush}} would stuck for 
> ever.
> When we roll the wal, we would create a new {{FanOutOneBlockAsyncDFSOutput}} 
> and a new {{AsyncProtobufLogWriter}}, in {{AsyncProtobufLogWriter.init}} we 
> write wal header to {{FanOutOneBlockAsyncDFSOutput}} and wait it to complete. 
> If we run into this situation, the roll would stuck forever.
> I have simulate this case in the PR, and my fix is even through the  
> {{FanOutOneBlockAsyncDFSOutput.state}}  is already {{State.BROKEN}}, we would 
> still try to trigger {{Callback.future}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26679) Wait on the future returned by FanOutOneBlockAsyncDFSOutput.flush would stuck

2022-01-18 Thread Duo Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477834#comment-17477834
 ] 

Duo Zhang commented on HBASE-26679:
---

Ah, this is a problem. FanOutOneBlockAsyncDFSOutput.failed can be invoked 
without a out-going packet for a DN, as we could also call it in 
channelInactive.

But looking at the code, I do not think it can only be reproduced by the above 
scenario, as long as a DN responds faster than others and then fails, we can 
run into this situation and cause some future to be stuck forever.

Let me take a look at your PR.

Thanks for digging.

> Wait on the future returned by FanOutOneBlockAsyncDFSOutput.flush would stuck
> -
>
> Key: HBASE-26679
> URL: https://issues.apache.org/jira/browse/HBASE-26679
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 3.0.0-alpha-2, 2.4.9
>Reporter: chenglei
>Assignee: chenglei
>Priority: Major
>
> Consider there are three dataNodes: dn1,dn2,and dn3, and we write some data 
> to {{FanOutOneBlockAsyncDFSOutput}} and then flush it, there are one 
> {{Callback}} in {{FanOutOneBlockAsyncDFSOutput.waitingAckQueue}}.  If the ack 
> from dn1 arrives firstly and triggers Netty to invoke 
> {{FanOutOneBlockAsyncDFSOutput.completed}} with dn1's channel, then in 
> {{FanOutOneBlockAsyncDFSOutput.completed}}, dn1's channel is removed from 
> {{Callback.unfinishedReplicas}}. 
> But dn2 and dn3 respond slowly, before dn2 and dn3 sending ack , dn1 is shut 
> down or have a exception, so {{FanOutOneBlockAsyncDFSOutput.failed}} is 
> triggered by Netty with dn1's channel, and because the 
> {{Callback.unfinishedReplicas}} does not contain dn1's channel,the 
> {{Callback}} is skipped in {{FanOutOneBlockAsyncDFSOutput.failed}} method, 
> just as following line250, and at line 245, 
> {{FanOutOneBlockAsyncDFSOutput.state}} is set to {{State.BROKEN}}.
> {code:java}
> 233  private synchronized void failed(Channel channel, Supplier 
> errorSupplier) {
> 234 if (state == State.BROKEN || state == State.CLOSED) {
> 235 return;
> 236  }
>  
> 244// disable further write, and fail all pending ack.
> 245state = State.BROKEN;
> 246Throwable error = errorSupplier.get();
> 247for (Iterator iter = waitingAckQueue.iterator(); 
> iter.hasNext();) {
> 248  Callback c = iter.next();
> 249  // find the first sync request which we have not acked yet and fail 
> all the request after it.
> 250  if (!c.unfinishedReplicas.contains(channel.id())) {
> 251continue;
> 252  }
> 253  for (;;) {
> 254c.future.completeExceptionally(error);
> 255if (!iter.hasNext()) {
> 256  break;
> 257}
> 258c = iter.next();
> 259  }
> 260break;
> 261}
> 262   datanodeInfoMap.keySet().forEach(ChannelOutboundInvoker::close);
> 263  }
> {code}
> At the end of above method in line 262, dn1,dn2 and dn3 are all closed, so 
> the {{FanOutOneBlockAsyncDFSOutput.failed}} is triggered again by dn2 and 
> dn3, but at the above line 234, because 
> {{FanOutOneBlockAsyncDFSOutput.state}}  is already {{State.BROKEN}}, the 
> whole  {{FanOutOneBlockAsyncDFSOutput.failed}}  is skipped. So the wait on 
> the future returned by {{FanOutOneBlockAsyncDFSOutput.flush}} would stuck for 
> ever.
> When we roll the wal, we would create a new {{FanOutOneBlockAsyncDFSOutput}} 
> and a new {{AsyncProtobufLogWriter}}, in {{AsyncProtobufLogWriter.init}} we 
> write wal header to {{FanOutOneBlockAsyncDFSOutput}} and wait it to complete. 
> If we run into this situation, the roll would stuck forever.
> I have simulate this case in the PR, and my fix is even through the  
> {{FanOutOneBlockAsyncDFSOutput.state}}  is already {{State.BROKEN}}, we would 
> still try to trigger {{Callback.future}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)