[jira] [Comment Edited] (HDFS-14361) SNN will always upload fsimage

2019-03-15 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793450#comment-16793450
 ] 

star edited comment on HDFS-14361 at 3/15/19 8:30 AM:
--

[~brahmareddy], please wait.

Maybe we should reconsider the design doc [^Multiple-Standby-NameNodes_V1.pdf] 
. The following code is from that doc:

 
{code:java}
boolean sendRequest = needCheckpoint || isPrimaryCheckPointer
|| secsSinceLast >= checkpointConf.getQuietPeriod();{code}
Further more, SNN will send download image request as shown in comments above 
line 429 in 3 cases:
 * rollback request
 * are the checkpointer
 * are outside the quiet period

But from the patch only in later two cases will SNN send download request. I 
think it causes issue HDFS-12248.
{code:java}
boolean sendRequest = isPrimaryCheckPointer
|| secsSinceLast >= checkpointConf.getQuietPeriod();{code}
I think we should move isPrimaryCheckPointer outside of 'if' block to avoid a 
inconsistent state that there are more than 1 SNN with isPrimaryCheckPointer = 
true, though it will not break anything.
  
 Fix code like this:
{code:java}
boolean sendRequest = needRollbackCheckpoint || isPrimaryCheckPointer
|| secsSinceLast >= checkpointConf.getQuietPeriod();{code}
Or we should fix the comments.

 


was (Author: starphin):
[~brahmareddy], please wait.

Maybe we should reconsider the design doc [^Multiple-Standby-NameNodes_V1.pdf] 
. The following code is from that doc:

 
{code:java}
boolean sendRequest = needCheckpoint || isPrimaryCheckPointer
|| secsSinceLast >= checkpointConf.getQuietPeriod();{code}
Further more, SNN will send download image request as shown in comments above 
line 429 in 3 cases:
 * rollback request
 * are the checkpointer
 * are outside the quiet period

But from the patch only in later two cases will SNN send download request. I 
think it causes issue HDFS-12248.
{code:java}
boolean sendRequest = isPrimaryCheckPointer
|| secsSinceLast >= checkpointConf.getQuietPeriod();{code}
I think we should move isPrimaryCheckPointer outside of 'if' block to avoid a 
inconsistent state that there are more than 1 SNN with isPrimaryCheckPointer = 
true, though it will not break anything.
 
Fix code like this:
{code:java}
boolean sendRequest = needRollbackCheckpoint || isPrimaryCheckPointer
|| secsSinceLast >= checkpointConf.getQuietPeriod();{code}
Or we should fix the comments.

 

> SNN will always upload fsimage
> --
>
> Key: HDFS-14361
> URL: https://issues.apache.org/jira/browse/HDFS-14361
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha, namenode
>Affects Versions: 3.2.0
>Reporter: hunshenshi
>Priority: Major
> Fix For: 3.2.0
>
>
> Related to -HDFS-12248.-
> {code:java}
> boolean sendRequest = isPrimaryCheckPointer
> || secsSinceLastUpload >= checkpointConf.getQuietPeriod();
> doCheckpoint(sendRequest);
> {code}
> If sendRequest is true, SNN will upload fsimage. But isPrimaryCheckPointer 
> always is true,
> {code:java}
> if (ie == null && ioe == null) {
>   //Update only when response from remote about success or
>   lastUploadTime = monotonicNow();
>   // we are primary if we successfully updated the ANN
>   this.isPrimaryCheckPointer = success;
> }
> {code}
> isPrimaryCheckPointer should be outside the if condition.
> If the ANN update was not successful, then isPrimaryCheckPointer should be 
> set to false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14361) SNN will always upload fsimage

2019-03-15 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793450#comment-16793450
 ] 

star edited comment on HDFS-14361 at 3/15/19 8:30 AM:
--

[~brahmareddy], please wait.

Maybe we should reconsider the design doc [^Multiple-Standby-NameNodes_V1.pdf] 
. The following code is from that doc:

 
{code:java}
boolean sendRequest = needCheckpoint || isPrimaryCheckPointer
|| secsSinceLast >= checkpointConf.getQuietPeriod();{code}
Further more, SNN will send download image request as shown in comments above 
line 429 in 3 cases:
 * rollback request
 * are the checkpointer
 * are outside the quiet period

But from the patch only in later two cases will SNN send download request. I 
think it causes issue HDFS-12248.
{code:java}
boolean sendRequest = isPrimaryCheckPointer
|| secsSinceLast >= checkpointConf.getQuietPeriod();{code}
I think we should move isPrimaryCheckPointer outside of 'if' block to avoid a 
inconsistent state that there are more than 1 SNN with isPrimaryCheckPointer = 
true, though it will not break anything.
 
Fix code like this:
{code:java}
boolean sendRequest = needRollbackCheckpoint || isPrimaryCheckPointer
|| secsSinceLast >= checkpointConf.getQuietPeriod();{code}
Or we should fix the comments.

 


was (Author: starphin):
[~brahmareddy], please wait.

Maybe we should reconsider the design doc [^Multiple-Standby-NameNodes_V1.pdf] 
. The following code is from that doc:

 
{code:java}
boolean sendRequest = needCheckpoint || isPrimaryCheckPointer
|| secsSinceLast >= checkpointConf.getQuietPeriod();{code}
Further more, SNN will send download image request as shown in comments above 
line 429 in 3 cases:
 * rollback request
 * are the checkpointer
 * are outside the quiet period

But from the patch only in later two cases will SNN send download request. I 
think it causes issue HDFS-12248.
  boolean sendRequest = isPrimaryCheckPointer
  || secsSinceLastUpload >= checkpointConf.getQuietPeriod();
I think we should move isPrimaryCheckPointer outside of 'if' block to avoid a 
inconsistent state that there are more than 1 SNN with isPrimaryCheckPointer = 
true, though it will not break anything.

Fix code like this:
boolean sendRequest = needRollbackCheckpoint || isPrimaryCheckPointer
|| secsSinceLastUpload >= checkpointConf.getQuietPeriod();
Thus all SNN will send request everytime rollbackCheckpoint is triggered.

Or we should fix the comments.

 

> SNN will always upload fsimage
> --
>
> Key: HDFS-14361
> URL: https://issues.apache.org/jira/browse/HDFS-14361
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha, namenode
>Affects Versions: 3.2.0
>Reporter: hunshenshi
>Priority: Major
> Fix For: 3.2.0
>
>
> Related to -HDFS-12248.-
> {code:java}
> boolean sendRequest = isPrimaryCheckPointer
> || secsSinceLastUpload >= checkpointConf.getQuietPeriod();
> doCheckpoint(sendRequest);
> {code}
> If sendRequest is true, SNN will upload fsimage. But isPrimaryCheckPointer 
> always is true,
> {code:java}
> if (ie == null && ioe == null) {
>   //Update only when response from remote about success or
>   lastUploadTime = monotonicNow();
>   // we are primary if we successfully updated the ANN
>   this.isPrimaryCheckPointer = success;
> }
> {code}
> isPrimaryCheckPointer should be outside the if condition.
> If the ANN update was not successful, then isPrimaryCheckPointer should be 
> set to false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14361) SNN will always upload fsimage

2019-03-15 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793450#comment-16793450
 ] 

star commented on HDFS-14361:
-

[~brahmareddy], please wait.

Maybe we should reconsider the design doc [^Multiple-Standby-NameNodes_V1.pdf] 
. The following code is from that doc:

 
{code:java}
boolean sendRequest = needCheckpoint || isPrimaryCheckPointer
|| secsSinceLast >= checkpointConf.getQuietPeriod();{code}
Further more, SNN will send download image request as shown in comments above 
line 429 in 3 cases:
 * rollback request
 * are the checkpointer
 * are outside the quiet period

But from the patch only in later two cases will SNN send download request. I 
think it causes issue HDFS-12248.
  boolean sendRequest = isPrimaryCheckPointer
  || secsSinceLastUpload >= checkpointConf.getQuietPeriod();
I think we should move isPrimaryCheckPointer outside of 'if' block to avoid a 
inconsistent state that there are more than 1 SNN with isPrimaryCheckPointer = 
true, though it will not break anything.

Fix code like this:
boolean sendRequest = needRollbackCheckpoint || isPrimaryCheckPointer
|| secsSinceLastUpload >= checkpointConf.getQuietPeriod();
Thus all SNN will send request everytime rollbackCheckpoint is triggered.

Or we should fix the comments.

 

> SNN will always upload fsimage
> --
>
> Key: HDFS-14361
> URL: https://issues.apache.org/jira/browse/HDFS-14361
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha, namenode
>Affects Versions: 3.2.0
>Reporter: hunshenshi
>Priority: Major
> Fix For: 3.2.0
>
>
> Related to -HDFS-12248.-
> {code:java}
> boolean sendRequest = isPrimaryCheckPointer
> || secsSinceLastUpload >= checkpointConf.getQuietPeriod();
> doCheckpoint(sendRequest);
> {code}
> If sendRequest is true, SNN will upload fsimage. But isPrimaryCheckPointer 
> always is true,
> {code:java}
> if (ie == null && ioe == null) {
>   //Update only when response from remote about success or
>   lastUploadTime = monotonicNow();
>   // we are primary if we successfully updated the ANN
>   this.isPrimaryCheckPointer = success;
> }
> {code}
> isPrimaryCheckPointer should be outside the if condition.
> If the ANN update was not successful, then isPrimaryCheckPointer should be 
> set to false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14361) SNN will always upload fsimage

2019-03-14 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793366#comment-16793366
 ] 

star commented on HDFS-14361:
-

[~hunhun], right.

> SNN will always upload fsimage
> --
>
> Key: HDFS-14361
> URL: https://issues.apache.org/jira/browse/HDFS-14361
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha, namenode
>Affects Versions: 3.2.0
>Reporter: hunshenshi
>Priority: Major
> Fix For: 3.2.0
>
>
> Related to -HDFS-12248.-
> {code:java}
> boolean sendRequest = isPrimaryCheckPointer
> || secsSinceLastUpload >= checkpointConf.getQuietPeriod();
> doCheckpoint(sendRequest);
> {code}
> If sendRequest is true, SNN will upload fsimage. But isPrimaryCheckPointer 
> always is true,
> {code:java}
> if (ie == null && ioe == null) {
>   //Update only when response from remote about success or
>   lastUploadTime = monotonicNow();
>   // we are primary if we successfully updated the ANN
>   this.isPrimaryCheckPointer = success;
> }
> {code}
> isPrimaryCheckPointer should be outside the if condition.
> If the ANN update was not successful, then isPrimaryCheckPointer should be 
> set to false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14361) SNN will always upload fsimage

2019-03-14 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793345#comment-16793345
 ] 

star commented on HDFS-14361:
-

_if another SNN send checkpoint request to ANN successfully_, current SNN will 
get a TransferFsImage.OLD_TRANSACTION_ID_FAILURE or 
HttpServletResponse.SC_CONFLICT status, not 
TransferFsImage.TransferResult.SUCCESS.

Var 'success' will still be false with no exception. Then 
_isPrimaryCheckPointer_ will be set false.
{code:java}
boolean success = false;
if (upload.get() == TransferFsImage.TransferResult.SUCCESS) {
  success = true;
  //avoid getting the rest of the results - we don't care since we had a 
successful upload
  break;
}
{code}
 

> SNN will always upload fsimage
> --
>
> Key: HDFS-14361
> URL: https://issues.apache.org/jira/browse/HDFS-14361
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha, namenode
>Affects Versions: 3.2.0
>Reporter: hunshenshi
>Priority: Major
> Fix For: 3.2.0
>
>
> Related to -HDFS-12248.-
> {code:java}
> boolean sendRequest = isPrimaryCheckPointer
> || secsSinceLastUpload >= checkpointConf.getQuietPeriod();
> doCheckpoint(sendRequest);
> {code}
> If sendRequest is true, SNN will upload fsimage. But isPrimaryCheckPointer 
> always is true,
> {code:java}
> if (ie == null && ioe == null) {
>   //Update only when response from remote about success or
>   lastUploadTime = monotonicNow();
>   // we are primary if we successfully updated the ANN
>   this.isPrimaryCheckPointer = success;
> }
> {code}
> isPrimaryCheckPointer should be outside the if condition.
> If the ANN update was not successful, then isPrimaryCheckPointer should be 
> set to false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14361) SNN will always upload fsimage

2019-03-14 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793261#comment-16793261
 ] 

star edited comment on HDFS-14361 at 3/15/19 6:06 AM:
--

  Let's go back to original design doc 
[HDFS-6440|https://issues.apache.org/jira/secure/attachment/12677453/Multiple-Standby-NameNodes_V1.pdf]
 SNN will get a HttpServletResponse.SC_CONFLICT status when ANN already 
download image file from other SNN.

So it is not a serious issue if all SNN send checkpoint request to ANN.

 Further more, SNN will send download image request as shown in comments 
above line 429 in 3 cases:
 * {color:#808080}rollback request{color}
 * {color:#808080}are the checkpointer{color}
 * {color:#808080} are outside the quiet period{color}

{color:#808080}But from the patch only in later two case will SNN send download 
request. I think it causes issue{color} HDFS-12248.
 
  
{code:java}
if (needCheckpoint) {
  // on all nodes, we build the checkpoint. However, we only ship the 
checkpoint if have a
  // rollback request, are the checkpointer, are outside the quiet period.
  final long secsSinceLastUpload = (now - lastUploadTime) / 1000;
  boolean sendRequest = isPrimaryCheckPointer
  || secsSinceLastUpload >= checkpointConf.getQuietPeriod();
  doCheckpoint(sendRequest);
  ...
}{code}

 I agree to to move isPrimaryCheckPointer outside of 'if' block to avoid a 
inconsistent state that there are more than 1  SNN with isPrimaryCheckPointer = 
true, though it will not break anything.
 
 As to HDFS-12248, I think we may change sendRequest as following:

 
{code:java}
boolean sendRequest = needRollbackCheckpoint || isPrimaryCheckPointer
|| secsSinceLastUpload >= checkpointConf.getQuietPeriod();
{code}
Thus all SNN will send request everytime rollbackCheckpoint is triggered. Or we 
should fix the comments.

 

 


was (Author: starphin):
  Let's go back to original design doc [^Multiple-Standby-NameNodes_V1.pdf] 
SNN will get a HttpServletResponse.SC_CONFLICT status when ANN already download 
image file from other SNN.

So it is not a serious issue if all SNN send checkpoint request to ANN.

 Further more, SNN will send download image request as shown in comments 
above line 429 in 3 cases:
 * {color:#808080}rollback request{color}
 * {color:#808080}are the checkpointer{color}
 * {color:#808080} are outside the quiet period{color}

{color:#808080}{color:#33}But from the patch only in later two case will 
SNN send download request. I think it causes issue{color} HDFS-12248.{color}

 
{code:java}
if (needCheckpoint) {
  // on all nodes, we build the checkpoint. However, we only ship the 
checkpoint if have a
  // rollback request, are the checkpointer, are outside the quiet period.
  final long secsSinceLastUpload = (now - lastUploadTime) / 1000;
  boolean sendRequest = isPrimaryCheckPointer
  || secsSinceLastUpload >= checkpointConf.getQuietPeriod();
  doCheckpoint(sendRequest);
  ...
}{code}
I agree to to move isPrimaryCheckPointer outside of 'if' block to avoid a 
inconsistent state that there are more than 1  SNN with isPrimaryCheckPointer = 
true, though it will not break anything.

As to {color:#808080}HDFS-12248,{color:#33} I think{color}{color} we may 
change sendRequest as following:
{code:java}
boolean sendRequest = needRollbackCheckpoint || isPrimaryCheckPointer
|| secsSinceLastUpload >= checkpointConf.getQuietPeriod();
{code}
Thus all SNN will send request everytime rollbackCheckpoint is triggered. Or we 
should fix the comments.

 

 

> SNN will always upload fsimage
> --
>
> Key: HDFS-14361
> URL: https://issues.apache.org/jira/browse/HDFS-14361
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha, namenode
>Affects Versions: 3.2.0
>Reporter: hunshenshi
>Priority: Major
> Fix For: 3.2.0
>
>
> Related to -HDFS-12248.-
> {code:java}
> boolean sendRequest = isPrimaryCheckPointer
> || secsSinceLastUpload >= checkpointConf.getQuietPeriod();
> doCheckpoint(sendRequest);
> {code}
> If sendRequest is true, SNN will upload fsimage. But isPrimaryCheckPointer 
> always is true,
> {code:java}
> if (ie == null && ioe == null) {
>   //Update only when response from remote about success or
>   lastUploadTime = monotonicNow();
>   // we are primary if we successfully updated the ANN
>   this.isPrimaryCheckPointer = success;
> }
> {code}
> isPrimaryCheckPointer should be outside the if condition.
> If the ANN update was not successful, then isPrimaryCheckPointer should be 
> set to false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hado

[jira] [Commented] (HDFS-14361) SNN will always upload fsimage

2019-03-14 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793295#comment-16793295
 ] 

star commented on HDFS-14361:
-

[~hunhun], _gotcha._ 

_If there is only one SNN,  it make sense for sendCheckpoint always being true._

_If more than one SNN,_ _sendCheckpoint_ _will be set false if another SNN send 
checkpoint request to ANN successfully. So I don't think it is a bug._ 

_But still I agree to move isPrimaryCheckPointer outside of 'if' block, to 
avoid a shortly inconsistent state that there are more than 1  SNN with 
isPrimaryCheckPointer = true._

> SNN will always upload fsimage
> --
>
> Key: HDFS-14361
> URL: https://issues.apache.org/jira/browse/HDFS-14361
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha, namenode
>Affects Versions: 3.2.0
>Reporter: hunshenshi
>Priority: Major
> Fix For: 3.2.0
>
>
> Related to -HDFS-12248.-
> {code:java}
> boolean sendRequest = isPrimaryCheckPointer
> || secsSinceLastUpload >= checkpointConf.getQuietPeriod();
> doCheckpoint(sendRequest);
> {code}
> If sendRequest is true, SNN will upload fsimage. But isPrimaryCheckPointer 
> always is true,
> {code:java}
> if (ie == null && ioe == null) {
>   //Update only when response from remote about success or
>   lastUploadTime = monotonicNow();
>   // we are primary if we successfully updated the ANN
>   this.isPrimaryCheckPointer = success;
> }
> {code}
> isPrimaryCheckPointer should be outside the if condition.
> If the ANN update was not successful, then isPrimaryCheckPointer should be 
> set to false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14361) SNN will always upload fsimage

2019-03-14 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793276#comment-16793276
 ] 

star commented on HDFS-14361:
-

[~hunhun], what do you mean by 'next code never work',

 if sendCheckpoint = true, the code will not return.
if(!sendCheckpoint){  return;
}
 

> SNN will always upload fsimage
> --
>
> Key: HDFS-14361
> URL: https://issues.apache.org/jira/browse/HDFS-14361
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha, namenode
>Affects Versions: 3.2.0
>Reporter: hunshenshi
>Priority: Major
> Fix For: 3.2.0
>
>
> Related to -HDFS-12248.-
> {code:java}
> boolean sendRequest = isPrimaryCheckPointer
> || secsSinceLastUpload >= checkpointConf.getQuietPeriod();
> doCheckpoint(sendRequest);
> {code}
> If sendRequest is true, SNN will upload fsimage. But isPrimaryCheckPointer 
> always is true,
> {code:java}
> if (ie == null && ioe == null) {
>   //Update only when response from remote about success or
>   lastUploadTime = monotonicNow();
>   // we are primary if we successfully updated the ANN
>   this.isPrimaryCheckPointer = success;
> }
> {code}
> isPrimaryCheckPointer should be outside the if condition.
> If the ANN update was not successful, then isPrimaryCheckPointer should be 
> set to false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14361) SNN will always upload fsimage

2019-03-14 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793261#comment-16793261
 ] 

star commented on HDFS-14361:
-

  Let's go back to original design doc [^Multiple-Standby-NameNodes_V1.pdf] 
SNN will get a HttpServletResponse.SC_CONFLICT status when ANN already download 
image file from other SNN.

So it is not a serious issue if all SNN send checkpoint request to ANN.

 Further more, SNN will send download image request as shown in comments 
above line 429 in 3 cases:
 * {color:#808080}rollback request{color}
 * {color:#808080}are the checkpointer{color}
 * {color:#808080} are outside the quiet period{color}

{color:#808080}{color:#33}But from the patch only in later two case will 
SNN send download request. I think it causes issue{color} HDFS-12248.{color}

 
{code:java}
if (needCheckpoint) {
  // on all nodes, we build the checkpoint. However, we only ship the 
checkpoint if have a
  // rollback request, are the checkpointer, are outside the quiet period.
  final long secsSinceLastUpload = (now - lastUploadTime) / 1000;
  boolean sendRequest = isPrimaryCheckPointer
  || secsSinceLastUpload >= checkpointConf.getQuietPeriod();
  doCheckpoint(sendRequest);
  ...
}{code}
I agree to to move isPrimaryCheckPointer outside of 'if' block to avoid a 
inconsistent state that there are more than 1  SNN with isPrimaryCheckPointer = 
true, though it will not break anything.

As to {color:#808080}HDFS-12248,{color:#33} I think{color}{color} we may 
change sendRequest as following:
{code:java}
boolean sendRequest = needRollbackCheckpoint || isPrimaryCheckPointer
|| secsSinceLastUpload >= checkpointConf.getQuietPeriod();
{code}
Thus all SNN will send request everytime rollbackCheckpoint is triggered. Or we 
should fix the comments.

 

 

> SNN will always upload fsimage
> --
>
> Key: HDFS-14361
> URL: https://issues.apache.org/jira/browse/HDFS-14361
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha, namenode
>Affects Versions: 3.2.0
>Reporter: hunshenshi
>Priority: Major
> Fix For: 3.2.0
>
>
> Related to -HDFS-12248.-
> {code:java}
> boolean sendRequest = isPrimaryCheckPointer
> || secsSinceLastUpload >= checkpointConf.getQuietPeriod();
> doCheckpoint(sendRequest);
> {code}
> If sendRequest is true, SNN will upload fsimage. But isPrimaryCheckPointer 
> always is true,
> {code:java}
> if (ie == null && ioe == null) {
>   //Update only when response from remote about success or
>   lastUploadTime = monotonicNow();
>   // we are primary if we successfully updated the ANN
>   this.isPrimaryCheckPointer = success;
> }
> {code}
> isPrimaryCheckPointer should be outside the if condition.
> If the ANN update was not successful, then isPrimaryCheckPointer should be 
> set to false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10659) Namenode crashes after Journalnode re-installation in an HA cluster due to missing paxos directory

2019-03-08 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16788469#comment-16788469
 ] 

star commented on HDFS-10659:
-

[~jojochuang],if is worth commiting, I would like to fix it to trunk.

> Namenode crashes after Journalnode re-installation in an HA cluster due to 
> missing paxos directory
> --
>
> Key: HDFS-10659
> URL: https://issues.apache.org/jira/browse/HDFS-10659
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: ha, journal-node
>Affects Versions: 2.7.0
>Reporter: Amit Anand
>Assignee: Hanisha Koneru
>Priority: Major
> Attachments: HDFS-10659.000.patch, HDFS-10659.001.patch, 
> HDFS-10659.002.patch, HDFS-10659.003.patch, HDFS-10659.004.patch
>
>
> In my environment I am seeing {{Namenodes}} crashing down after majority of 
> {{Journalnodes}} are re-installed. We manage multiple clusters and do rolling 
> upgrades followed by rolling re-install of each node including master(NN, JN, 
> RM, ZK) nodes. When a journal node is re-installed or moved to a new 
> disk/host, instead of running {{"initializeSharedEdits"}} command, I copy 
> {{VERSION}} file from one of the other {{Journalnode}} and that allows my 
> {{NN}} to start writing data to the newly installed {{Journalnode}}.
> To acheive quorum for JN and recover unfinalized segments NN during starupt 
> creates .tmp files under {{"/jn/current/paxos"}} directory . In 
> current implementation "paxos" directry is only created during 
> {{"initializeSharedEdits"}} command and if a JN is re-installed the "paxos" 
> directory is not created upon JN startup or by NN while writing .tmp 
> files which causes NN to crash with following error message:
> {code}
> 192.168.100.16:8485: /disk/1/dfs/jn/Test-Laptop/current/paxos/64044.tmp (No 
> such file or directory)
> at java.io.FileOutputStream.open(Native Method)
> at java.io.FileOutputStream.(FileOutputStream.java:221)
> at java.io.FileOutputStream.(FileOutputStream.java:171)
> at 
> org.apache.hadoop.hdfs.util.AtomicFileOutputStream.(AtomicFileOutputStream.java:58)
> at 
> org.apache.hadoop.hdfs.qjournal.server.Journal.persistPaxosData(Journal.java:971)
> at 
> org.apache.hadoop.hdfs.qjournal.server.Journal.acceptRecovery(Journal.java:846)
> at 
> org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.acceptRecovery(JournalNodeRpcServer.java:205)
> at 
> org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.acceptRecovery(QJournalProtocolServerSideTranslatorPB.java:249)
> at 
> org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25435)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
> {code}
> The current 
> [getPaxosFile|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JNStorage.java#L128-L130]
>  method simply returns a path to a file under "paxos" directory without 
> verifiying its existence. Since "paxos" directoy holds files that are 
> required for NN recovery and acheiving JN quorum my proposed solution is to 
> add a check to "getPaxosFile" method and create the {{"paxos"}} directory if 
> it is missing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10659) Namenode crashes after Journalnode re-installation in an HA cluster due to missing paxos directory

2019-03-08 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16787947#comment-16787947
 ] 

star commented on HDFS-10659:
-

+1. Recently, our hadoop namenode shutdown when switching active namenode, just 
because of missing paxos directory. It is created in the default /tmp path and 
deleted by os for no operation in 7 days. We can avoid this by moving journal 
directory to a none tmp dir, but to make sure namenode works well by a default 
config will be better.

> Namenode crashes after Journalnode re-installation in an HA cluster due to 
> missing paxos directory
> --
>
> Key: HDFS-10659
> URL: https://issues.apache.org/jira/browse/HDFS-10659
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: ha, journal-node
>Affects Versions: 2.7.0
>Reporter: Amit Anand
>Assignee: Hanisha Koneru
>Priority: Major
> Attachments: HDFS-10659.000.patch, HDFS-10659.001.patch, 
> HDFS-10659.002.patch, HDFS-10659.003.patch, HDFS-10659.004.patch
>
>
> In my environment I am seeing {{Namenodes}} crashing down after majority of 
> {{Journalnodes}} are re-installed. We manage multiple clusters and do rolling 
> upgrades followed by rolling re-install of each node including master(NN, JN, 
> RM, ZK) nodes. When a journal node is re-installed or moved to a new 
> disk/host, instead of running {{"initializeSharedEdits"}} command, I copy 
> {{VERSION}} file from one of the other {{Journalnode}} and that allows my 
> {{NN}} to start writing data to the newly installed {{Journalnode}}.
> To acheive quorum for JN and recover unfinalized segments NN during starupt 
> creates .tmp files under {{"/jn/current/paxos"}} directory . In 
> current implementation "paxos" directry is only created during 
> {{"initializeSharedEdits"}} command and if a JN is re-installed the "paxos" 
> directory is not created upon JN startup or by NN while writing .tmp 
> files which causes NN to crash with following error message:
> {code}
> 192.168.100.16:8485: /disk/1/dfs/jn/Test-Laptop/current/paxos/64044.tmp (No 
> such file or directory)
> at java.io.FileOutputStream.open(Native Method)
> at java.io.FileOutputStream.(FileOutputStream.java:221)
> at java.io.FileOutputStream.(FileOutputStream.java:171)
> at 
> org.apache.hadoop.hdfs.util.AtomicFileOutputStream.(AtomicFileOutputStream.java:58)
> at 
> org.apache.hadoop.hdfs.qjournal.server.Journal.persistPaxosData(Journal.java:971)
> at 
> org.apache.hadoop.hdfs.qjournal.server.Journal.acceptRecovery(Journal.java:846)
> at 
> org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.acceptRecovery(JournalNodeRpcServer.java:205)
> at 
> org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.acceptRecovery(QJournalProtocolServerSideTranslatorPB.java:249)
> at 
> org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25435)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
> {code}
> The current 
> [getPaxosFile|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JNStorage.java#L128-L130]
>  method simply returns a path to a file under "paxos" directory without 
> verifiying its existence. Since "paxos" directoy holds files that are 
> required for NN recovery and acheiving JN quorum my proposed solution is to 
> add a check to "getPaxosFile" method and create the {{"paxos"}} directory if 
> it is missing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14340) Lower the log level when can't get postOpAttr

2019-03-07 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16786758#comment-16786758
 ] 

star commented on HDFS-14340:
-

# switch to branch 'trunk'
 # edit your code and execute cmd 'git diff > ../HDFS-14340.trunk.patch'

> Lower the log level when can't get postOpAttr
> -
>
> Key: HDFS-14340
> URL: https://issues.apache.org/jira/browse/HDFS-14340
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: nfs
>Reporter: Anuhan Torgonshar
>Priority: Major
>  Labels: easyfix
> Attachments: HDFS-14340.master.001.patch
>
>
> I think should lower the log level when can't get postOpAttr in 
> _*hadoop-2.8.5-src/hadoop-hdfs-project/hadoop-hdfs-nfs/src/main/java/org/apache/hadoop/hdfs/nfs/nfs3/**RpcProgramNfs3.java*_.
>  
>  
> {code:java}
> **The fisrt code snippet***
> //the problematic log level ERROR, at line 1044
> try {
>dirWcc = Nfs3Utils.createWccData(Nfs3Utils.getWccAttr(preOpDirAttr),
>dfsClient, dirFileIdPath, iug);
> } catch (IOException e1) {
>LOG.error("Can't get postOpDirAttr for dirFileId: "
>+ dirHandle.getFileId(), e1);
> }
> **The second code snippet***
> //other practice in similar code snippets, line number is 475, the log 
> assigned with INFO level
> try { 
>wccData = Nfs3Utils.createWccData(Nfs3Utils.getWccAttr(preOpAttr), 
> dfsClient,   fileIdPath, iug); 
> } catch (IOException e1) { 
>LOG.info("Can't get postOpAttr for fileIdPath: " + fileIdPath, e1); 
> }
> **The third code snippet***
> //other practice in similar code snippets, line number is 1405, the log 
> assigned with INFO level
> try {
>fromDirWcc = Nfs3Utils.createWccData(
>Nfs3Utils.getWccAttr(fromPreOpAttr), dfsClient, fromDirFileIdPath,iug);
>toDirWcc = Nfs3Utils.createWccData(Nfs3Utils.getWccAttr(toPreOpAttr),
>dfsClient, toDirFileIdPath, iug);
> } catch (IOException e1) {
>LOG.info("Can't get postOpDirAttr for " + fromDirFileIdPath + " or"
>+ toDirFileIdPath, e1);
> }
> {code}
> Therefore, I think the logging practices should be consistent in similar 
> contexts. When the code catches _*IOException*_ for *_getWccAttr()_* method, 
> it more likely prints a log message with _*INFO*_ level, a lower level.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14340) Lower the log level when can't get postOpAttr

2019-03-06 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16786337#comment-16786337
 ] 

star commented on HDFS-14340:
-

[~OneisAll], you can make a patch for the issue.

> Lower the log level when can't get postOpAttr
> -
>
> Key: HDFS-14340
> URL: https://issues.apache.org/jira/browse/HDFS-14340
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: nfs
>Affects Versions: 3.1.0, 2.8.5
>Reporter: Anuhan Torgonshar
>Priority: Major
> Attachments: RpcProgramNfs3.java
>
>
> I think should lower the log level when can't get postOpAttr in 
> _*hadoop-2.8.5-src/hadoop-hdfs-project/hadoop-hdfs-nfs/src/main/java/org/apache/hadoop/hdfs/nfs/nfs3/**RpcProgramNfs3.java*_.
>  
>  
> {code:java}
> **The fisrt code snippet***
> //the problematic log level ERROR, at line 1044
> try {
>dirWcc = Nfs3Utils.createWccData(Nfs3Utils.getWccAttr(preOpDirAttr),
>dfsClient, dirFileIdPath, iug);
> } catch (IOException e1) {
>LOG.error("Can't get postOpDirAttr for dirFileId: "
>+ dirHandle.getFileId(), e1);
> }
> **The second code snippet***
> //other practice in similar code snippets, line number is 475, the log 
> assigned with INFO level
> try { 
>wccData = Nfs3Utils.createWccData(Nfs3Utils.getWccAttr(preOpAttr), 
> dfsClient,   fileIdPath, iug); 
> } catch (IOException e1) { 
>LOG.info("Can't get postOpAttr for fileIdPath: " + fileIdPath, e1); 
> }
> **The third code snippet***
> //other practice in similar code snippets, line number is 1405, the log 
> assigned with INFO level
> try {
>fromDirWcc = Nfs3Utils.createWccData(
>Nfs3Utils.getWccAttr(fromPreOpAttr), dfsClient, fromDirFileIdPath,iug);
>toDirWcc = Nfs3Utils.createWccData(Nfs3Utils.getWccAttr(toPreOpAttr),
>dfsClient, toDirFileIdPath, iug);
> } catch (IOException e1) {
>LOG.info("Can't get postOpDirAttr for " + fromDirFileIdPath + " or"
>+ toDirFileIdPath, e1);
> }
> {code}
> Therefore, I think the logging practices should be consistent in similar 
> contexts. When the code catches _*IOException*_ for *_getWccAttr()_* method, 
> it more likely prints a log message with _*INFO*_ level, a lower level.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14340) Lower the log level when can't get postOpAttr

2019-03-06 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16786326#comment-16786326
 ] 

star commented on HDFS-14340:
-

Checking more code, indeed there are inconsistency in those code——write() with 
info log level, create with error log info, while others with warn and info log 
level. 

> Lower the log level when can't get postOpAttr
> -
>
> Key: HDFS-14340
> URL: https://issues.apache.org/jira/browse/HDFS-14340
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: nfs
>Affects Versions: 3.1.0, 2.8.5
>Reporter: Anuhan Torgonshar
>Priority: Major
> Attachments: RpcProgramNfs3.java
>
>
> I think should lower the log level when can't get postOpAttr in 
> _*hadoop-2.8.5-src/hadoop-hdfs-project/hadoop-hdfs-nfs/src/main/java/org/apache/hadoop/hdfs/nfs/nfs3/**RpcProgramNfs3.java*_.
>  
>  
> {code:java}
> **The fisrt code snippet***
> //the problematic log level ERROR, at line 1044
> try {
>dirWcc = Nfs3Utils.createWccData(Nfs3Utils.getWccAttr(preOpDirAttr),
>dfsClient, dirFileIdPath, iug);
> } catch (IOException e1) {
>LOG.error("Can't get postOpDirAttr for dirFileId: "
>+ dirHandle.getFileId(), e1);
> }
> **The second code snippet***
> //other practice in similar code snippets, line number is 475, the log 
> assigned with INFO level
> try { 
>wccData = Nfs3Utils.createWccData(Nfs3Utils.getWccAttr(preOpAttr), 
> dfsClient,   fileIdPath, iug); 
> } catch (IOException e1) { 
>LOG.info("Can't get postOpAttr for fileIdPath: " + fileIdPath, e1); 
> }
> **The third code snippet***
> //other practice in similar code snippets, line number is 1405, the log 
> assigned with INFO level
> try {
>fromDirWcc = Nfs3Utils.createWccData(
>Nfs3Utils.getWccAttr(fromPreOpAttr), dfsClient, fromDirFileIdPath,iug);
>toDirWcc = Nfs3Utils.createWccData(Nfs3Utils.getWccAttr(toPreOpAttr),
>dfsClient, toDirFileIdPath, iug);
> } catch (IOException e1) {
>LOG.info("Can't get postOpDirAttr for " + fromDirFileIdPath + " or"
>+ toDirFileIdPath, e1);
> }
> {code}
> Therefore, I think the logging practices should be consistent in similar 
> contexts. When the code catches _*IOException*_ for *_getWccAttr()_* method, 
> it more likely prints a log message with _*INFO*_ level, a lower level.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14340) Lower the log level when can't get postOpAttr

2019-03-06 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16786286#comment-16786286
 ] 

star commented on HDFS-14340:
-

According to the original code, I think:
 # There's a warn log before that exception.
 # The code still work well even with exception thrown out. 

{code:java}
//代码占位符
} catch (IOException e) {
  LOG.warn("Exception ", e);
  WccData wccData = null;
  try {
wccData = Nfs3Utils.createWccData(Nfs3Utils.getWccAttr(preOpAttr),
dfsClient, fileIdPath, iug);
  } catch (IOException e1) {
LOG.info("Can't get postOpAttr for fileIdPath: " + fileIdPath, e1);
  }

  int status = mapErrorStatus(e);
  return new SETATTR3Response(status, wccData);
}
{code}

> Lower the log level when can't get postOpAttr
> -
>
> Key: HDFS-14340
> URL: https://issues.apache.org/jira/browse/HDFS-14340
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: nfs
>Affects Versions: 3.1.0, 2.8.5
>Reporter: Anuhan Torgonshar
>Priority: Major
> Attachments: RpcProgramNfs3.java
>
>
> I think should lower the log level when can't get postOpAttr in 
> _*hadoop-2.8.5-src/hadoop-hdfs-project/hadoop-hdfs-nfs/src/main/java/org/apache/hadoop/hdfs/nfs/nfs3/**RpcProgramNfs3.java*_.
>  
>  
> {code:java}
> //the problematic log level ERROR, at line 1044
> try {
>dirWcc = Nfs3Utils.createWccData(Nfs3Utils.getWccAttr(preOpDirAttr),
>dfsClient, dirFileIdPath, iug);
> } catch (IOException e1) {
>LOG.error("Can't get postOpDirAttr for dirFileId: "
>+ dirHandle.getFileId(), e1);
> }
> //other practice in similar code snippets, line number is 475, the log 
> assigned with INFO level
> try { 
>wccData = Nfs3Utils.createWccData(Nfs3Utils.getWccAttr(preOpAttr), 
> dfsClient,   fileIdPath, iug); 
> } catch (IOException e1) { 
>LOG.info("Can't get postOpAttr for fileIdPath: " + fileIdPath, e1); 
> }
> //other practice in similar code snippets, line number is 1405, the log 
> assigned with INFO level
> try {
>fromDirWcc = Nfs3Utils.createWccData(
>Nfs3Utils.getWccAttr(fromPreOpAttr), dfsClient, fromDirFileIdPath,iug);
>toDirWcc = Nfs3Utils.createWccData(Nfs3Utils.getWccAttr(toPreOpAttr),
>dfsClient, toDirFileIdPath, iug);
> } catch (IOException e1) {
>LOG.info("Can't get postOpDirAttr for " + fromDirFileIdPath + " or"
>+ toDirFileIdPath, e1);
> }
> {code}
> Therefore, I think the logging practices should be consistent in similar 
> contexts. When the code catches _*IOException*_ for *_getWccAttr()_* method, 
> it more likely prints a log message with _*INFO*_ level, a lower level.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14317) Standby does not trigger edit log rolling when in-progress edit log tailing is enabled

2019-03-05 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16785112#comment-16785112
 ] 

star commented on HDFS-14317:
-

thanks [~ekanth].

> Standby does not trigger edit log rolling when in-progress edit log tailing 
> is enabled
> --
>
> Key: HDFS-14317
> URL: https://issues.apache.org/jira/browse/HDFS-14317
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.0.0
>Reporter: Ekanth Sethuramalingam
>Assignee: Ekanth Sethuramalingam
>Priority: Critical
> Attachments: HDFS-14317.001.patch, HDFS-14317.002.patch, 
> HDFS-14317.003.patch, HDFS-14317.004.patch
>
>
> The standby uses the following method to check if it is time to trigger edit 
> log rolling on active.
> {code}
>   /**
>* @return true if the configured log roll period has elapsed.
>*/
>   private boolean tooLongSinceLastLoad() {
> return logRollPeriodMs >= 0 && 
>   (monotonicNow() - lastLoadTimeMs) > logRollPeriodMs ;
>   }
> {code}
> In doTailEdits(), lastLoadTimeMs is updated when standby is able to 
> successfully tail any edits
> {code}
>   if (editsLoaded > 0) {
> lastLoadTimeMs = monotonicNow();
>   }
> {code}
> The default configuration for {{dfs.ha.log-roll.period}} is 120 seconds and 
> {{dfs.ha.tail-edits.period}} is 60 seconds. With in-progress edit log tailing 
> enabled, tooLongSinceLastLoad() will almost never return true resulting in 
> edit logs not rolled for a long time until this configuration 
> {{dfs.namenode.edit.log.autoroll.multiplier.threshold}} takes effect.
> [In our deployment, this resulted in in-progress edit logs getting deleted. 
> The sequence of events is that standby was able to checkpoint twice while the 
> in-progress edit log was growing on active. When the 
> NNStorageRetentionManager decided to cleanup old checkpoints and edit logs, 
> it cleaned up the in-progress edit log from active and QJM (as the txnid on 
> in-progress edit log was older than the 2 most recent checkpoints) resulting 
> in irrecoverably losing a few minutes worth of metadata].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14317) Standby does not trigger edit log rolling when in-progress edit log tailing is enabled

2019-03-04 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16784017#comment-16784017
 ] 

star commented on HDFS-14317:
-

[~ekanth], nice patch.

I am not very clear about 'With in-progress edit log tailing enabled, 
tooLongSinceLastLoad() will almost never return true resulting in edit logs not 
rolled '. Can you describe it deeply? 

> Standby does not trigger edit log rolling when in-progress edit log tailing 
> is enabled
> --
>
> Key: HDFS-14317
> URL: https://issues.apache.org/jira/browse/HDFS-14317
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.0.0
>Reporter: Ekanth Sethuramalingam
>Assignee: Ekanth Sethuramalingam
>Priority: Critical
> Attachments: HDFS-14317.001.patch, HDFS-14317.002.patch, 
> HDFS-14317.003.patch, HDFS-14317.004.patch
>
>
> The standby uses the following method to check if it is time to trigger edit 
> log rolling on active.
> {code}
>   /**
>* @return true if the configured log roll period has elapsed.
>*/
>   private boolean tooLongSinceLastLoad() {
> return logRollPeriodMs >= 0 && 
>   (monotonicNow() - lastLoadTimeMs) > logRollPeriodMs ;
>   }
> {code}
> In doTailEdits(), lastLoadTimeMs is updated when standby is able to 
> successfully tail any edits
> {code}
>   if (editsLoaded > 0) {
> lastLoadTimeMs = monotonicNow();
>   }
> {code}
> The default configuration for {{dfs.ha.log-roll.period}} is 120 seconds and 
> {{dfs.ha.tail-edits.period}} is 60 seconds. With in-progress edit log tailing 
> enabled, tooLongSinceLastLoad() will almost never return true resulting in 
> edit logs not rolled for a long time until this configuration 
> {{dfs.namenode.edit.log.autoroll.multiplier.threshold}} takes effect.
> [In our deployment, this resulted in in-progress edit logs getting deleted. 
> The sequence of events is that standby was able to checkpoint twice while the 
> in-progress edit log was growing on active. When the 
> NNStorageRetentionManager decided to cleanup old checkpoints and edit logs, 
> it cleaned up the in-progress edit log from active and QJM (as the txnid on 
> in-progress edit log was older than the 2 most recent checkpoints) resulting 
> in irrecoverably losing a few minutes worth of metadata].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14314) fullBlockReportLeaseId should be reset after registering to NN

2019-03-04 Thread star (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

star updated HDFS-14314:

Attachment: HDFS-14314-branch-2.8.001.patch

> fullBlockReportLeaseId should be reset after registering to NN
> --
>
> Key: HDFS-14314
> URL: https://issues.apache.org/jira/browse/HDFS-14314
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.8.4
> Environment:  
>  
>  
>Reporter: star
>Assignee: star
>Priority: Critical
> Fix For: 2.10.0, 3.0.4, 3.3.0, 3.2.1, 2.9.3, 3.1.3
>
> Attachments: HDFS-14314-branch-2.8.001.patch, 
> HDFS-14314-trunk.001.patch, HDFS-14314-trunk.001.patch, 
> HDFS-14314-trunk.002.patch, HDFS-14314-trunk.003.patch, 
> HDFS-14314-trunk.004.patch, HDFS-14314-trunk.005.patch, 
> HDFS-14314-trunk.006.patch, HDFS-14314.0.patch, HDFS-14314.2.patch, 
> HDFS-14314.branch-2.001.patch, HDFS-14314.patch
>
>
>   since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
> block lease id from active NN before sending full block to NN. Then DN will 
> send full block report together with lease id. If the lease id is invalid, NN 
> will reject the full block report and log "not in the pending set".
>   In a case when DN is doing full block reporting while NN is restarted. 
> It happens that DN will later send a full block report with lease id 
> ,acquired from previous NN instance, which is invalid to the new NN instance. 
> Though DN recognized the new NN instance by heartbeat and reregister itself, 
> it did not reset the lease id from previous instance.
>   The issuse may cause DNs to temporarily go dead, making it unsafe to 
> restart NN especially in hadoop cluster which has large amount of DNs. 
> HDFS-12914 reported the issue  without any clues why it occurred and remain 
> unsolved.
>    To make it clear, look at code below. We take it from method 
> offerService of class BPServiceActor. We eliminate some code to focus on 
> current issue. fullBlockReportLeaseId is a local variable to hold lease id 
> from NN. Exceptions will occur at blockReport call when NN restarting, which 
> will be caught by catch block in while loop. Thus fullBlockReportLeaseId will 
> not be set to 0. After NN restarted, DN will send full block report which 
> will be rejected by the new NN instance. DN will never send full block report 
> until the next full block report schedule, about an hour later.
>   Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
> exception or after registering to NN. Thus it will ask for a valid 
> fullBlockReportLeaseId from new NN instance.
> {code:java}
> private void offerService() throws Exception {
>   long fullBlockReportLeaseId = 0;
>   //
>   // Now loop for a long time
>   //
>   while (shouldRun()) {
> try {
>   final long startTime = scheduler.monotonicNow();
>   //
>   // Every so often, send heartbeat or block-report
>   //
>   final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
>   HeartbeatResponse resp = null;
>   if (sendHeartbeat) {
>   
> boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
> scheduler.isBlockReportDue(startTime);
> scheduler.scheduleNextHeartbeat();
> if (!dn.areHeartbeatsDisabledForTests()) {
>   resp = sendHeartBeat(requestBlockReportLease);
>   assert resp != null;
>   if (resp.getFullBlockReportLeaseId() != 0) {
> if (fullBlockReportLeaseId != 0) {
>   LOG.warn(nnAddr + " sent back a full block report lease " +
>   "ID of 0x" +
>   Long.toHexString(resp.getFullBlockReportLeaseId()) +
>   ", but we already have a lease ID of 0x" +
>   Long.toHexString(fullBlockReportLeaseId) + ". " +
>   "Overwriting old lease ID.");
> }
> fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
>   }
>  
> }
>   }
>
>  
>   if ((fullBlockReportLeaseId != 0) || forceFullBr) {
> //Exception occurred here when NN restarting
> cmds = blockReport(fullBlockReportLeaseId);
> fullBlockReportLeaseId = 0;
>   }
>   
> } catch(RemoteException re) {
>   
>   } // while (shouldRun())
> } // offerService{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14314) fullBlockReportLeaseId should be reset after registering to NN

2019-03-04 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16783942#comment-16783942
 ] 

star commented on HDFS-14314:
-

[~jojochuang], append to your sammerization.

     2. Afterwards, DN reset its local lease id to 0 {color:#59afe1}right after 
reregistering to new NN instance.{color}

 

{color:#33}As to branch-2.8, should I submit a new patch named like 
HDFS-14314-branch-2.8.001.patch?{color}

> fullBlockReportLeaseId should be reset after registering to NN
> --
>
> Key: HDFS-14314
> URL: https://issues.apache.org/jira/browse/HDFS-14314
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.8.4
> Environment:  
>  
>  
>Reporter: star
>Assignee: star
>Priority: Critical
> Fix For: 2.10.0, 3.0.4, 3.3.0, 3.2.1, 2.9.3, 3.1.3
>
> Attachments: HDFS-14314-trunk.001.patch, HDFS-14314-trunk.001.patch, 
> HDFS-14314-trunk.002.patch, HDFS-14314-trunk.003.patch, 
> HDFS-14314-trunk.004.patch, HDFS-14314-trunk.005.patch, 
> HDFS-14314-trunk.006.patch, HDFS-14314.0.patch, HDFS-14314.2.patch, 
> HDFS-14314.branch-2.001.patch, HDFS-14314.patch
>
>
>   since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
> block lease id from active NN before sending full block to NN. Then DN will 
> send full block report together with lease id. If the lease id is invalid, NN 
> will reject the full block report and log "not in the pending set".
>   In a case when DN is doing full block reporting while NN is restarted. 
> It happens that DN will later send a full block report with lease id 
> ,acquired from previous NN instance, which is invalid to the new NN instance. 
> Though DN recognized the new NN instance by heartbeat and reregister itself, 
> it did not reset the lease id from previous instance.
>   The issuse may cause DNs to temporarily go dead, making it unsafe to 
> restart NN especially in hadoop cluster which has large amount of DNs. 
> HDFS-12914 reported the issue  without any clues why it occurred and remain 
> unsolved.
>    To make it clear, look at code below. We take it from method 
> offerService of class BPServiceActor. We eliminate some code to focus on 
> current issue. fullBlockReportLeaseId is a local variable to hold lease id 
> from NN. Exceptions will occur at blockReport call when NN restarting, which 
> will be caught by catch block in while loop. Thus fullBlockReportLeaseId will 
> not be set to 0. After NN restarted, DN will send full block report which 
> will be rejected by the new NN instance. DN will never send full block report 
> until the next full block report schedule, about an hour later.
>   Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
> exception or after registering to NN. Thus it will ask for a valid 
> fullBlockReportLeaseId from new NN instance.
> {code:java}
> private void offerService() throws Exception {
>   long fullBlockReportLeaseId = 0;
>   //
>   // Now loop for a long time
>   //
>   while (shouldRun()) {
> try {
>   final long startTime = scheduler.monotonicNow();
>   //
>   // Every so often, send heartbeat or block-report
>   //
>   final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
>   HeartbeatResponse resp = null;
>   if (sendHeartbeat) {
>   
> boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
> scheduler.isBlockReportDue(startTime);
> scheduler.scheduleNextHeartbeat();
> if (!dn.areHeartbeatsDisabledForTests()) {
>   resp = sendHeartBeat(requestBlockReportLease);
>   assert resp != null;
>   if (resp.getFullBlockReportLeaseId() != 0) {
> if (fullBlockReportLeaseId != 0) {
>   LOG.warn(nnAddr + " sent back a full block report lease " +
>   "ID of 0x" +
>   Long.toHexString(resp.getFullBlockReportLeaseId()) +
>   ", but we already have a lease ID of 0x" +
>   Long.toHexString(fullBlockReportLeaseId) + ". " +
>   "Overwriting old lease ID.");
> }
> fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
>   }
>  
> }
>   }
>
>  
>   if ((fullBlockReportLeaseId != 0) || forceFullBr) {
> //Exception occurred here when NN restarting
> cmds = blockReport(fullBlockReportLeaseId);
> fullBlockReportLeaseId = 0;
>   }
>   
> } catch(RemoteException re) {
>   
>   } // while (shouldRun())
> } // offerService{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

--

[jira] [Commented] (HDFS-14314) fullBlockReportLeaseId should be reset after registering to NN

2019-03-04 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16783256#comment-16783256
 ] 

star commented on HDFS-14314:
-

[~jojochuang], could you help review the new patch?

> fullBlockReportLeaseId should be reset after registering to NN
> --
>
> Key: HDFS-14314
> URL: https://issues.apache.org/jira/browse/HDFS-14314
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.8.4
> Environment:  
>  
>  
>Reporter: star
>Assignee: star
>Priority: Critical
> Fix For: 2.8.4
>
> Attachments: HDFS-14314-trunk.001.patch, HDFS-14314-trunk.001.patch, 
> HDFS-14314-trunk.002.patch, HDFS-14314-trunk.003.patch, 
> HDFS-14314-trunk.004.patch, HDFS-14314-trunk.005.patch, 
> HDFS-14314-trunk.006.patch, HDFS-14314.0.patch, HDFS-14314.2.patch, 
> HDFS-14314.patch
>
>
>   since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
> block lease id from active NN before sending full block to NN. Then DN will 
> send full block report together with lease id. If the lease id is invalid, NN 
> will reject the full block report and log "not in the pending set".
>   In a case when DN is doing full block reporting while NN is restarted. 
> It happens that DN will later send a full block report with lease id 
> ,acquired from previous NN instance, which is invalid to the new NN instance. 
> Though DN recognized the new NN instance by heartbeat and reregister itself, 
> it did not reset the lease id from previous instance.
>   The issuse may cause DNs to temporarily go dead, making it unsafe to 
> restart NN especially in hadoop cluster which has large amount of DNs. 
> HDFS-12914 reported the issue  without any clues why it occurred and remain 
> unsolved.
>    To make it clear, look at code below. We take it from method 
> offerService of class BPServiceActor. We eliminate some code to focus on 
> current issue. fullBlockReportLeaseId is a local variable to hold lease id 
> from NN. Exceptions will occur at blockReport call when NN restarting, which 
> will be caught by catch block in while loop. Thus fullBlockReportLeaseId will 
> not be set to 0. After NN restarted, DN will send full block report which 
> will be rejected by the new NN instance. DN will never send full block report 
> until the next full block report schedule, about an hour later.
>   Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
> exception or after registering to NN. Thus it will ask for a valid 
> fullBlockReportLeaseId from new NN instance.
> {code:java}
> private void offerService() throws Exception {
>   long fullBlockReportLeaseId = 0;
>   //
>   // Now loop for a long time
>   //
>   while (shouldRun()) {
> try {
>   final long startTime = scheduler.monotonicNow();
>   //
>   // Every so often, send heartbeat or block-report
>   //
>   final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
>   HeartbeatResponse resp = null;
>   if (sendHeartbeat) {
>   
> boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
> scheduler.isBlockReportDue(startTime);
> scheduler.scheduleNextHeartbeat();
> if (!dn.areHeartbeatsDisabledForTests()) {
>   resp = sendHeartBeat(requestBlockReportLease);
>   assert resp != null;
>   if (resp.getFullBlockReportLeaseId() != 0) {
> if (fullBlockReportLeaseId != 0) {
>   LOG.warn(nnAddr + " sent back a full block report lease " +
>   "ID of 0x" +
>   Long.toHexString(resp.getFullBlockReportLeaseId()) +
>   ", but we already have a lease ID of 0x" +
>   Long.toHexString(fullBlockReportLeaseId) + ". " +
>   "Overwriting old lease ID.");
> }
> fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
>   }
>  
> }
>   }
>
>  
>   if ((fullBlockReportLeaseId != 0) || forceFullBr) {
> //Exception occurred here when NN restarting
> cmds = blockReport(fullBlockReportLeaseId);
> fullBlockReportLeaseId = 0;
>   }
>   
> } catch(RemoteException re) {
>   
>   } // while (shouldRun())
> } // offerService{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14314) fullBlockReportLeaseId should be reset after registering to NN

2019-03-01 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16782284#comment-16782284
 ] 

star commented on HDFS-14314:
-

[~jojochuang], thanks for pointing out the mistake. new patch uploaded. Now 
test will fail without the fix. 

> fullBlockReportLeaseId should be reset after registering to NN
> --
>
> Key: HDFS-14314
> URL: https://issues.apache.org/jira/browse/HDFS-14314
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.8.4
> Environment:  
>  
>  
>Reporter: star
>Assignee: star
>Priority: Critical
> Fix For: 2.8.4
>
> Attachments: HDFS-14314-trunk.001.patch, HDFS-14314-trunk.001.patch, 
> HDFS-14314-trunk.002.patch, HDFS-14314-trunk.003.patch, 
> HDFS-14314-trunk.004.patch, HDFS-14314-trunk.005.patch, 
> HDFS-14314-trunk.006.patch, HDFS-14314.0.patch, HDFS-14314.2.patch, 
> HDFS-14314.patch
>
>
>   since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
> block lease id from active NN before sending full block to NN. Then DN will 
> send full block report together with lease id. If the lease id is invalid, NN 
> will reject the full block report and log "not in the pending set".
>   In a case when DN is doing full block reporting while NN is restarted. 
> It happens that DN will later send a full block report with lease id 
> ,acquired from previous NN instance, which is invalid to the new NN instance. 
> Though DN recognized the new NN instance by heartbeat and reregister itself, 
> it did not reset the lease id from previous instance.
>   The issuse may cause DNs to temporarily go dead, making it unsafe to 
> restart NN especially in hadoop cluster which has large amount of DNs. 
> HDFS-12914 reported the issue  without any clues why it occurred and remain 
> unsolved.
>    To make it clear, look at code below. We take it from method 
> offerService of class BPServiceActor. We eliminate some code to focus on 
> current issue. fullBlockReportLeaseId is a local variable to hold lease id 
> from NN. Exceptions will occur at blockReport call when NN restarting, which 
> will be caught by catch block in while loop. Thus fullBlockReportLeaseId will 
> not be set to 0. After NN restarted, DN will send full block report which 
> will be rejected by the new NN instance. DN will never send full block report 
> until the next full block report schedule, about an hour later.
>   Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
> exception or after registering to NN. Thus it will ask for a valid 
> fullBlockReportLeaseId from new NN instance.
> {code:java}
> private void offerService() throws Exception {
>   long fullBlockReportLeaseId = 0;
>   //
>   // Now loop for a long time
>   //
>   while (shouldRun()) {
> try {
>   final long startTime = scheduler.monotonicNow();
>   //
>   // Every so often, send heartbeat or block-report
>   //
>   final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
>   HeartbeatResponse resp = null;
>   if (sendHeartbeat) {
>   
> boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
> scheduler.isBlockReportDue(startTime);
> scheduler.scheduleNextHeartbeat();
> if (!dn.areHeartbeatsDisabledForTests()) {
>   resp = sendHeartBeat(requestBlockReportLease);
>   assert resp != null;
>   if (resp.getFullBlockReportLeaseId() != 0) {
> if (fullBlockReportLeaseId != 0) {
>   LOG.warn(nnAddr + " sent back a full block report lease " +
>   "ID of 0x" +
>   Long.toHexString(resp.getFullBlockReportLeaseId()) +
>   ", but we already have a lease ID of 0x" +
>   Long.toHexString(fullBlockReportLeaseId) + ". " +
>   "Overwriting old lease ID.");
> }
> fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
>   }
>  
> }
>   }
>
>  
>   if ((fullBlockReportLeaseId != 0) || forceFullBr) {
> //Exception occurred here when NN restarting
> cmds = blockReport(fullBlockReportLeaseId);
> fullBlockReportLeaseId = 0;
>   }
>   
> } catch(RemoteException re) {
>   
>   } // while (shouldRun())
> } // offerService{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14314) fullBlockReportLeaseId should be reset after registering to NN

2019-03-01 Thread star (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

star updated HDFS-14314:

Attachment: HDFS-14314-trunk.006.patch

> fullBlockReportLeaseId should be reset after registering to NN
> --
>
> Key: HDFS-14314
> URL: https://issues.apache.org/jira/browse/HDFS-14314
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.8.4
> Environment:  
>  
>  
>Reporter: star
>Assignee: star
>Priority: Critical
> Fix For: 2.8.4
>
> Attachments: HDFS-14314-trunk.001.patch, HDFS-14314-trunk.001.patch, 
> HDFS-14314-trunk.002.patch, HDFS-14314-trunk.003.patch, 
> HDFS-14314-trunk.004.patch, HDFS-14314-trunk.005.patch, 
> HDFS-14314-trunk.006.patch, HDFS-14314.0.patch, HDFS-14314.2.patch, 
> HDFS-14314.patch
>
>
>   since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
> block lease id from active NN before sending full block to NN. Then DN will 
> send full block report together with lease id. If the lease id is invalid, NN 
> will reject the full block report and log "not in the pending set".
>   In a case when DN is doing full block reporting while NN is restarted. 
> It happens that DN will later send a full block report with lease id 
> ,acquired from previous NN instance, which is invalid to the new NN instance. 
> Though DN recognized the new NN instance by heartbeat and reregister itself, 
> it did not reset the lease id from previous instance.
>   The issuse may cause DNs to temporarily go dead, making it unsafe to 
> restart NN especially in hadoop cluster which has large amount of DNs. 
> HDFS-12914 reported the issue  without any clues why it occurred and remain 
> unsolved.
>    To make it clear, look at code below. We take it from method 
> offerService of class BPServiceActor. We eliminate some code to focus on 
> current issue. fullBlockReportLeaseId is a local variable to hold lease id 
> from NN. Exceptions will occur at blockReport call when NN restarting, which 
> will be caught by catch block in while loop. Thus fullBlockReportLeaseId will 
> not be set to 0. After NN restarted, DN will send full block report which 
> will be rejected by the new NN instance. DN will never send full block report 
> until the next full block report schedule, about an hour later.
>   Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
> exception or after registering to NN. Thus it will ask for a valid 
> fullBlockReportLeaseId from new NN instance.
> {code:java}
> private void offerService() throws Exception {
>   long fullBlockReportLeaseId = 0;
>   //
>   // Now loop for a long time
>   //
>   while (shouldRun()) {
> try {
>   final long startTime = scheduler.monotonicNow();
>   //
>   // Every so often, send heartbeat or block-report
>   //
>   final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
>   HeartbeatResponse resp = null;
>   if (sendHeartbeat) {
>   
> boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
> scheduler.isBlockReportDue(startTime);
> scheduler.scheduleNextHeartbeat();
> if (!dn.areHeartbeatsDisabledForTests()) {
>   resp = sendHeartBeat(requestBlockReportLease);
>   assert resp != null;
>   if (resp.getFullBlockReportLeaseId() != 0) {
> if (fullBlockReportLeaseId != 0) {
>   LOG.warn(nnAddr + " sent back a full block report lease " +
>   "ID of 0x" +
>   Long.toHexString(resp.getFullBlockReportLeaseId()) +
>   ", but we already have a lease ID of 0x" +
>   Long.toHexString(fullBlockReportLeaseId) + ". " +
>   "Overwriting old lease ID.");
> }
> fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
>   }
>  
> }
>   }
>
>  
>   if ((fullBlockReportLeaseId != 0) || forceFullBr) {
> //Exception occurred here when NN restarting
> cmds = blockReport(fullBlockReportLeaseId);
> fullBlockReportLeaseId = 0;
>   }
>   
> } catch(RemoteException re) {
>   
>   } // while (shouldRun())
> } // offerService{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14314) fullBlockReportLeaseId should be reset after registering to NN

2019-02-28 Thread star (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

star updated HDFS-14314:

Attachment: HDFS-14314-trunk.005.patch

> fullBlockReportLeaseId should be reset after registering to NN
> --
>
> Key: HDFS-14314
> URL: https://issues.apache.org/jira/browse/HDFS-14314
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.8.4
> Environment:  
>  
>  
>Reporter: star
>Priority: Critical
> Fix For: 2.8.4
>
> Attachments: HDFS-14314-trunk.001.patch, HDFS-14314-trunk.001.patch, 
> HDFS-14314-trunk.002.patch, HDFS-14314-trunk.003.patch, 
> HDFS-14314-trunk.004.patch, HDFS-14314-trunk.005.patch, HDFS-14314.0.patch, 
> HDFS-14314.2.patch, HDFS-14314.patch
>
>
>   since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
> block lease id from active NN before sending full block to NN. Then DN will 
> send full block report together with lease id. If the lease id is invalid, NN 
> will reject the full block report and log "not in the pending set".
>   In a case when DN is doing full block reporting while NN is restarted. 
> It happens that DN will later send a full block report with lease id 
> ,acquired from previous NN instance, which is invalid to the new NN instance. 
> Though DN recognized the new NN instance by heartbeat and reregister itself, 
> it did not reset the lease id from previous instance.
>   The issuse may cause DNs to temporarily go dead, making it unsafe to 
> restart NN especially in hadoop cluster which has large amount of DNs. 
> HDFS-12914 reported the issue  without any clues why it occurred and remain 
> unsolved.
>    To make it clear, look at code below. We take it from method 
> offerService of class BPServiceActor. We eliminate some code to focus on 
> current issue. fullBlockReportLeaseId is a local variable to hold lease id 
> from NN. Exceptions will occur at blockReport call when NN restarting, which 
> will be caught by catch block in while loop. Thus fullBlockReportLeaseId will 
> not be set to 0. After NN restarted, DN will send full block report which 
> will be rejected by the new NN instance. DN will never send full block report 
> until the next full block report schedule, about an hour later.
>   Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
> exception or after registering to NN. Thus it will ask for a valid 
> fullBlockReportLeaseId from new NN instance.
> {code:java}
> private void offerService() throws Exception {
>   long fullBlockReportLeaseId = 0;
>   //
>   // Now loop for a long time
>   //
>   while (shouldRun()) {
> try {
>   final long startTime = scheduler.monotonicNow();
>   //
>   // Every so often, send heartbeat or block-report
>   //
>   final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
>   HeartbeatResponse resp = null;
>   if (sendHeartbeat) {
>   
> boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
> scheduler.isBlockReportDue(startTime);
> scheduler.scheduleNextHeartbeat();
> if (!dn.areHeartbeatsDisabledForTests()) {
>   resp = sendHeartBeat(requestBlockReportLease);
>   assert resp != null;
>   if (resp.getFullBlockReportLeaseId() != 0) {
> if (fullBlockReportLeaseId != 0) {
>   LOG.warn(nnAddr + " sent back a full block report lease " +
>   "ID of 0x" +
>   Long.toHexString(resp.getFullBlockReportLeaseId()) +
>   ", but we already have a lease ID of 0x" +
>   Long.toHexString(fullBlockReportLeaseId) + ". " +
>   "Overwriting old lease ID.");
> }
> fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
>   }
>  
> }
>   }
>
>  
>   if ((fullBlockReportLeaseId != 0) || forceFullBr) {
> //Exception occurred here when NN restarting
> cmds = blockReport(fullBlockReportLeaseId);
> fullBlockReportLeaseId = 0;
>   }
>   
> } catch(RemoteException re) {
>   
>   } // while (shouldRun())
> } // offerService{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14314) fullBlockReportLeaseId should be reset after registering to NN

2019-02-27 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16780153#comment-16780153
 ] 

star commented on HDFS-14314:
-

checkstyle seems not consistent. 

Same code ".thenAnswer" at line 1000, 1002, 1003 has 

indentation 6, which passes checking, While code at 1010 and 1024 throws error.

How to fix such error?

 

> fullBlockReportLeaseId should be reset after registering to NN
> --
>
> Key: HDFS-14314
> URL: https://issues.apache.org/jira/browse/HDFS-14314
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.8.4
> Environment:  
>  
>  
>Reporter: star
>Priority: Critical
> Fix For: 2.8.4
>
> Attachments: HDFS-14314-trunk.001.patch, HDFS-14314-trunk.001.patch, 
> HDFS-14314-trunk.002.patch, HDFS-14314-trunk.003.patch, 
> HDFS-14314-trunk.004.patch, HDFS-14314.0.patch, HDFS-14314.2.patch, 
> HDFS-14314.patch
>
>
>   since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
> block lease id from active NN before sending full block to NN. Then DN will 
> send full block report together with lease id. If the lease id is invalid, NN 
> will reject the full block report and log "not in the pending set".
>   In a case when DN is doing full block reporting while NN is restarted. 
> It happens that DN will later send a full block report with lease id 
> ,acquired from previous NN instance, which is invalid to the new NN instance. 
> Though DN recognized the new NN instance by heartbeat and reregister itself, 
> it did not reset the lease id from previous instance.
>   The issuse may cause DNs to temporarily go dead, making it unsafe to 
> restart NN especially in hadoop cluster which has large amount of DNs. 
> HDFS-12914 reported the issue  without any clues why it occurred and remain 
> unsolved.
>    To make it clear, look at code below. We take it from method 
> offerService of class BPServiceActor. We eliminate some code to focus on 
> current issue. fullBlockReportLeaseId is a local variable to hold lease id 
> from NN. Exceptions will occur at blockReport call when NN restarting, which 
> will be caught by catch block in while loop. Thus fullBlockReportLeaseId will 
> not be set to 0. After NN restarted, DN will send full block report which 
> will be rejected by the new NN instance. DN will never send full block report 
> until the next full block report schedule, about an hour later.
>   Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
> exception or after registering to NN. Thus it will ask for a valid 
> fullBlockReportLeaseId from new NN instance.
> {code:java}
> private void offerService() throws Exception {
>   long fullBlockReportLeaseId = 0;
>   //
>   // Now loop for a long time
>   //
>   while (shouldRun()) {
> try {
>   final long startTime = scheduler.monotonicNow();
>   //
>   // Every so often, send heartbeat or block-report
>   //
>   final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
>   HeartbeatResponse resp = null;
>   if (sendHeartbeat) {
>   
> boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
> scheduler.isBlockReportDue(startTime);
> scheduler.scheduleNextHeartbeat();
> if (!dn.areHeartbeatsDisabledForTests()) {
>   resp = sendHeartBeat(requestBlockReportLease);
>   assert resp != null;
>   if (resp.getFullBlockReportLeaseId() != 0) {
> if (fullBlockReportLeaseId != 0) {
>   LOG.warn(nnAddr + " sent back a full block report lease " +
>   "ID of 0x" +
>   Long.toHexString(resp.getFullBlockReportLeaseId()) +
>   ", but we already have a lease ID of 0x" +
>   Long.toHexString(fullBlockReportLeaseId) + ". " +
>   "Overwriting old lease ID.");
> }
> fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
>   }
>  
> }
>   }
>
>  
>   if ((fullBlockReportLeaseId != 0) || forceFullBr) {
> //Exception occurred here when NN restarting
> cmds = blockReport(fullBlockReportLeaseId);
> fullBlockReportLeaseId = 0;
>   }
>   
> } catch(RemoteException re) {
>   
>   } // while (shouldRun())
> } // offerService{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14314) fullBlockReportLeaseId should be reset after registering to NN

2019-02-27 Thread star (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

star updated HDFS-14314:

Attachment: HDFS-14314-trunk.004.patch

> fullBlockReportLeaseId should be reset after registering to NN
> --
>
> Key: HDFS-14314
> URL: https://issues.apache.org/jira/browse/HDFS-14314
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.8.4
> Environment:  
>  
>  
>Reporter: star
>Priority: Critical
> Fix For: 2.8.4
>
> Attachments: HDFS-14314-trunk.001.patch, HDFS-14314-trunk.001.patch, 
> HDFS-14314-trunk.002.patch, HDFS-14314-trunk.003.patch, 
> HDFS-14314-trunk.004.patch, HDFS-14314.0.patch, HDFS-14314.2.patch, 
> HDFS-14314.patch
>
>
>   since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
> block lease id from active NN before sending full block to NN. Then DN will 
> send full block report together with lease id. If the lease id is invalid, NN 
> will reject the full block report and log "not in the pending set".
>   In a case when DN is doing full block reporting while NN is restarted. 
> It happens that DN will later send a full block report with lease id 
> ,acquired from previous NN instance, which is invalid to the new NN instance. 
> Though DN recognized the new NN instance by heartbeat and reregister itself, 
> it did not reset the lease id from previous instance.
>   The issuse may cause DNs to temporarily go dead, making it unsafe to 
> restart NN especially in hadoop cluster which has large amount of DNs. 
> HDFS-12914 reported the issue  without any clues why it occurred and remain 
> unsolved.
>    To make it clear, look at code below. We take it from method 
> offerService of class BPServiceActor. We eliminate some code to focus on 
> current issue. fullBlockReportLeaseId is a local variable to hold lease id 
> from NN. Exceptions will occur at blockReport call when NN restarting, which 
> will be caught by catch block in while loop. Thus fullBlockReportLeaseId will 
> not be set to 0. After NN restarted, DN will send full block report which 
> will be rejected by the new NN instance. DN will never send full block report 
> until the next full block report schedule, about an hour later.
>   Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
> exception or after registering to NN. Thus it will ask for a valid 
> fullBlockReportLeaseId from new NN instance.
> {code:java}
> private void offerService() throws Exception {
>   long fullBlockReportLeaseId = 0;
>   //
>   // Now loop for a long time
>   //
>   while (shouldRun()) {
> try {
>   final long startTime = scheduler.monotonicNow();
>   //
>   // Every so often, send heartbeat or block-report
>   //
>   final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
>   HeartbeatResponse resp = null;
>   if (sendHeartbeat) {
>   
> boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
> scheduler.isBlockReportDue(startTime);
> scheduler.scheduleNextHeartbeat();
> if (!dn.areHeartbeatsDisabledForTests()) {
>   resp = sendHeartBeat(requestBlockReportLease);
>   assert resp != null;
>   if (resp.getFullBlockReportLeaseId() != 0) {
> if (fullBlockReportLeaseId != 0) {
>   LOG.warn(nnAddr + " sent back a full block report lease " +
>   "ID of 0x" +
>   Long.toHexString(resp.getFullBlockReportLeaseId()) +
>   ", but we already have a lease ID of 0x" +
>   Long.toHexString(fullBlockReportLeaseId) + ". " +
>   "Overwriting old lease ID.");
> }
> fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
>   }
>  
> }
>   }
>
>  
>   if ((fullBlockReportLeaseId != 0) || forceFullBr) {
> //Exception occurred here when NN restarting
> cmds = blockReport(fullBlockReportLeaseId);
> fullBlockReportLeaseId = 0;
>   }
>   
> } catch(RemoteException re) {
>   
>   } // while (shouldRun())
> } // offerService{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14314) fullBlockReportLeaseId should be reset after registering to NN

2019-02-27 Thread star (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

star updated HDFS-14314:

Attachment: HDFS-14314-trunk.003.patch

> fullBlockReportLeaseId should be reset after registering to NN
> --
>
> Key: HDFS-14314
> URL: https://issues.apache.org/jira/browse/HDFS-14314
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.8.4
> Environment:  
>  
>  
>Reporter: star
>Priority: Critical
> Fix For: 2.8.4
>
> Attachments: HDFS-14314-trunk.001.patch, HDFS-14314-trunk.001.patch, 
> HDFS-14314-trunk.002.patch, HDFS-14314-trunk.003.patch, HDFS-14314.0.patch, 
> HDFS-14314.2.patch, HDFS-14314.patch
>
>
>   since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
> block lease id from active NN before sending full block to NN. Then DN will 
> send full block report together with lease id. If the lease id is invalid, NN 
> will reject the full block report and log "not in the pending set".
>   In a case when DN is doing full block reporting while NN is restarted. 
> It happens that DN will later send a full block report with lease id 
> ,acquired from previous NN instance, which is invalid to the new NN instance. 
> Though DN recognized the new NN instance by heartbeat and reregister itself, 
> it did not reset the lease id from previous instance.
>   The issuse may cause DNs to temporarily go dead, making it unsafe to 
> restart NN especially in hadoop cluster which has large amount of DNs. 
> HDFS-12914 reported the issue  without any clues why it occurred and remain 
> unsolved.
>    To make it clear, look at code below. We take it from method 
> offerService of class BPServiceActor. We eliminate some code to focus on 
> current issue. fullBlockReportLeaseId is a local variable to hold lease id 
> from NN. Exceptions will occur at blockReport call when NN restarting, which 
> will be caught by catch block in while loop. Thus fullBlockReportLeaseId will 
> not be set to 0. After NN restarted, DN will send full block report which 
> will be rejected by the new NN instance. DN will never send full block report 
> until the next full block report schedule, about an hour later.
>   Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
> exception or after registering to NN. Thus it will ask for a valid 
> fullBlockReportLeaseId from new NN instance.
> {code:java}
> private void offerService() throws Exception {
>   long fullBlockReportLeaseId = 0;
>   //
>   // Now loop for a long time
>   //
>   while (shouldRun()) {
> try {
>   final long startTime = scheduler.monotonicNow();
>   //
>   // Every so often, send heartbeat or block-report
>   //
>   final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
>   HeartbeatResponse resp = null;
>   if (sendHeartbeat) {
>   
> boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
> scheduler.isBlockReportDue(startTime);
> scheduler.scheduleNextHeartbeat();
> if (!dn.areHeartbeatsDisabledForTests()) {
>   resp = sendHeartBeat(requestBlockReportLease);
>   assert resp != null;
>   if (resp.getFullBlockReportLeaseId() != 0) {
> if (fullBlockReportLeaseId != 0) {
>   LOG.warn(nnAddr + " sent back a full block report lease " +
>   "ID of 0x" +
>   Long.toHexString(resp.getFullBlockReportLeaseId()) +
>   ", but we already have a lease ID of 0x" +
>   Long.toHexString(fullBlockReportLeaseId) + ". " +
>   "Overwriting old lease ID.");
> }
> fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
>   }
>  
> }
>   }
>
>  
>   if ((fullBlockReportLeaseId != 0) || forceFullBr) {
> //Exception occurred here when NN restarting
> cmds = blockReport(fullBlockReportLeaseId);
> fullBlockReportLeaseId = 0;
>   }
>   
> } catch(RemoteException re) {
>   
>   } // while (shouldRun())
> } // offerService{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14314) fullBlockReportLeaseId should be reset after registering to NN

2019-02-27 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16779096#comment-16779096
 ] 

star commented on HDFS-14314:
-

ok, I've run all failed test at local and passed. Code style will be fixed soon.

> fullBlockReportLeaseId should be reset after registering to NN
> --
>
> Key: HDFS-14314
> URL: https://issues.apache.org/jira/browse/HDFS-14314
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.8.4
> Environment:  
>  
>  
>Reporter: star
>Priority: Critical
> Fix For: 2.8.4
>
> Attachments: HDFS-14314-trunk.001.patch, HDFS-14314-trunk.001.patch, 
> HDFS-14314-trunk.002.patch, HDFS-14314.0.patch, HDFS-14314.2.patch, 
> HDFS-14314.patch
>
>
>   since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
> block lease id from active NN before sending full block to NN. Then DN will 
> send full block report together with lease id. If the lease id is invalid, NN 
> will reject the full block report and log "not in the pending set".
>   In a case when DN is doing full block reporting while NN is restarted. 
> It happens that DN will later send a full block report with lease id 
> ,acquired from previous NN instance, which is invalid to the new NN instance. 
> Though DN recognized the new NN instance by heartbeat and reregister itself, 
> it did not reset the lease id from previous instance.
>   The issuse may cause DNs to temporarily go dead, making it unsafe to 
> restart NN especially in hadoop cluster which has large amount of DNs. 
> HDFS-12914 reported the issue  without any clues why it occurred and remain 
> unsolved.
>    To make it clear, look at code below. We take it from method 
> offerService of class BPServiceActor. We eliminate some code to focus on 
> current issue. fullBlockReportLeaseId is a local variable to hold lease id 
> from NN. Exceptions will occur at blockReport call when NN restarting, which 
> will be caught by catch block in while loop. Thus fullBlockReportLeaseId will 
> not be set to 0. After NN restarted, DN will send full block report which 
> will be rejected by the new NN instance. DN will never send full block report 
> until the next full block report schedule, about an hour later.
>   Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
> exception or after registering to NN. Thus it will ask for a valid 
> fullBlockReportLeaseId from new NN instance.
> {code:java}
> private void offerService() throws Exception {
>   long fullBlockReportLeaseId = 0;
>   //
>   // Now loop for a long time
>   //
>   while (shouldRun()) {
> try {
>   final long startTime = scheduler.monotonicNow();
>   //
>   // Every so often, send heartbeat or block-report
>   //
>   final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
>   HeartbeatResponse resp = null;
>   if (sendHeartbeat) {
>   
> boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
> scheduler.isBlockReportDue(startTime);
> scheduler.scheduleNextHeartbeat();
> if (!dn.areHeartbeatsDisabledForTests()) {
>   resp = sendHeartBeat(requestBlockReportLease);
>   assert resp != null;
>   if (resp.getFullBlockReportLeaseId() != 0) {
> if (fullBlockReportLeaseId != 0) {
>   LOG.warn(nnAddr + " sent back a full block report lease " +
>   "ID of 0x" +
>   Long.toHexString(resp.getFullBlockReportLeaseId()) +
>   ", but we already have a lease ID of 0x" +
>   Long.toHexString(fullBlockReportLeaseId) + ". " +
>   "Overwriting old lease ID.");
> }
> fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
>   }
>  
> }
>   }
>
>  
>   if ((fullBlockReportLeaseId != 0) || forceFullBr) {
> //Exception occurred here when NN restarting
> cmds = blockReport(fullBlockReportLeaseId);
> fullBlockReportLeaseId = 0;
>   }
>   
> } catch(RemoteException re) {
>   
>   } // while (shouldRun())
> } // offerService{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14314) fullBlockReportLeaseId should be reset after registering to NN

2019-02-26 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16778907#comment-16778907
 ] 

star commented on HDFS-14314:
-

Still unexpected error !What's next?

> fullBlockReportLeaseId should be reset after registering to NN
> --
>
> Key: HDFS-14314
> URL: https://issues.apache.org/jira/browse/HDFS-14314
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.8.4
> Environment:  
>  
>  
>Reporter: star
>Priority: Critical
> Fix For: 2.8.4
>
> Attachments: HDFS-14314-trunk.001.patch, HDFS-14314-trunk.001.patch, 
> HDFS-14314-trunk.002.patch, HDFS-14314.0.patch, HDFS-14314.2.patch, 
> HDFS-14314.patch
>
>
>   since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
> block lease id from active NN before sending full block to NN. Then DN will 
> send full block report together with lease id. If the lease id is invalid, NN 
> will reject the full block report and log "not in the pending set".
>   In a case when DN is doing full block reporting while NN is restarted. 
> It happens that DN will later send a full block report with lease id 
> ,acquired from previous NN instance, which is invalid to the new NN instance. 
> Though DN recognized the new NN instance by heartbeat and reregister itself, 
> it did not reset the lease id from previous instance.
>   The issuse may cause DNs to temporarily go dead, making it unsafe to 
> restart NN especially in hadoop cluster which has large amount of DNs. 
> HDFS-12914 reported the issue  without any clues why it occurred and remain 
> unsolved.
>    To make it clear, look at code below. We take it from method 
> offerService of class BPServiceActor. We eliminate some code to focus on 
> current issue. fullBlockReportLeaseId is a local variable to hold lease id 
> from NN. Exceptions will occur at blockReport call when NN restarting, which 
> will be caught by catch block in while loop. Thus fullBlockReportLeaseId will 
> not be set to 0. After NN restarted, DN will send full block report which 
> will be rejected by the new NN instance. DN will never send full block report 
> until the next full block report schedule, about an hour later.
>   Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
> exception or after registering to NN. Thus it will ask for a valid 
> fullBlockReportLeaseId from new NN instance.
> {code:java}
> private void offerService() throws Exception {
>   long fullBlockReportLeaseId = 0;
>   //
>   // Now loop for a long time
>   //
>   while (shouldRun()) {
> try {
>   final long startTime = scheduler.monotonicNow();
>   //
>   // Every so often, send heartbeat or block-report
>   //
>   final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
>   HeartbeatResponse resp = null;
>   if (sendHeartbeat) {
>   
> boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
> scheduler.isBlockReportDue(startTime);
> scheduler.scheduleNextHeartbeat();
> if (!dn.areHeartbeatsDisabledForTests()) {
>   resp = sendHeartBeat(requestBlockReportLease);
>   assert resp != null;
>   if (resp.getFullBlockReportLeaseId() != 0) {
> if (fullBlockReportLeaseId != 0) {
>   LOG.warn(nnAddr + " sent back a full block report lease " +
>   "ID of 0x" +
>   Long.toHexString(resp.getFullBlockReportLeaseId()) +
>   ", but we already have a lease ID of 0x" +
>   Long.toHexString(fullBlockReportLeaseId) + ". " +
>   "Overwriting old lease ID.");
> }
> fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
>   }
>  
> }
>   }
>
>  
>   if ((fullBlockReportLeaseId != 0) || forceFullBr) {
> //Exception occurred here when NN restarting
> cmds = blockReport(fullBlockReportLeaseId);
> fullBlockReportLeaseId = 0;
>   }
>   
> } catch(RemoteException re) {
>   
>   } // while (shouldRun())
> } // offerService{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14314) fullBlockReportLeaseId should be reset after registering to NN

2019-02-26 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16778122#comment-16778122
 ] 

star commented on HDFS-14314:
-

[~hexiaoqiao], thanks. new patch updated.

> fullBlockReportLeaseId should be reset after registering to NN
> --
>
> Key: HDFS-14314
> URL: https://issues.apache.org/jira/browse/HDFS-14314
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.8.4
> Environment:  
>  
>  
>Reporter: star
>Priority: Critical
> Fix For: 2.8.4
>
> Attachments: HDFS-14314-trunk.001.patch, HDFS-14314-trunk.001.patch, 
> HDFS-14314-trunk.002.patch, HDFS-14314.0.patch, HDFS-14314.2.patch, 
> HDFS-14314.patch
>
>
>   since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
> block lease id from active NN before sending full block to NN. Then DN will 
> send full block report together with lease id. If the lease id is invalid, NN 
> will reject the full block report and log "not in the pending set".
>   In a case when DN is doing full block reporting while NN is restarted. 
> It happens that DN will later send a full block report with lease id 
> ,acquired from previous NN instance, which is invalid to the new NN instance. 
> Though DN recognized the new NN instance by heartbeat and reregister itself, 
> it did not reset the lease id from previous instance.
>   The issuse may cause DNs to temporarily go dead, making it unsafe to 
> restart NN especially in hadoop cluster which has large amount of DNs. 
> HDFS-12914 reported the issue  without any clues why it occurred and remain 
> unsolved.
>    To make it clear, look at code below. We take it from method 
> offerService of class BPServiceActor. We eliminate some code to focus on 
> current issue. fullBlockReportLeaseId is a local variable to hold lease id 
> from NN. Exceptions will occur at blockReport call when NN restarting, which 
> will be caught by catch block in while loop. Thus fullBlockReportLeaseId will 
> not be set to 0. After NN restarted, DN will send full block report which 
> will be rejected by the new NN instance. DN will never send full block report 
> until the next full block report schedule, about an hour later.
>   Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
> exception or after registering to NN. Thus it will ask for a valid 
> fullBlockReportLeaseId from new NN instance.
> {code:java}
> private void offerService() throws Exception {
>   long fullBlockReportLeaseId = 0;
>   //
>   // Now loop for a long time
>   //
>   while (shouldRun()) {
> try {
>   final long startTime = scheduler.monotonicNow();
>   //
>   // Every so often, send heartbeat or block-report
>   //
>   final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
>   HeartbeatResponse resp = null;
>   if (sendHeartbeat) {
>   
> boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
> scheduler.isBlockReportDue(startTime);
> scheduler.scheduleNextHeartbeat();
> if (!dn.areHeartbeatsDisabledForTests()) {
>   resp = sendHeartBeat(requestBlockReportLease);
>   assert resp != null;
>   if (resp.getFullBlockReportLeaseId() != 0) {
> if (fullBlockReportLeaseId != 0) {
>   LOG.warn(nnAddr + " sent back a full block report lease " +
>   "ID of 0x" +
>   Long.toHexString(resp.getFullBlockReportLeaseId()) +
>   ", but we already have a lease ID of 0x" +
>   Long.toHexString(fullBlockReportLeaseId) + ". " +
>   "Overwriting old lease ID.");
> }
> fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
>   }
>  
> }
>   }
>
>  
>   if ((fullBlockReportLeaseId != 0) || forceFullBr) {
> //Exception occurred here when NN restarting
> cmds = blockReport(fullBlockReportLeaseId);
> fullBlockReportLeaseId = 0;
>   }
>   
> } catch(RemoteException re) {
>   
>   } // while (shouldRun())
> } // offerService{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14314) fullBlockReportLeaseId should be reset after registering to NN

2019-02-26 Thread star (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

star updated HDFS-14314:

Attachment: HDFS-14314-trunk.002.patch

> fullBlockReportLeaseId should be reset after registering to NN
> --
>
> Key: HDFS-14314
> URL: https://issues.apache.org/jira/browse/HDFS-14314
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.8.4
> Environment:  
>  
>  
>Reporter: star
>Priority: Critical
> Fix For: 2.8.4
>
> Attachments: HDFS-14314-trunk.001.patch, HDFS-14314-trunk.001.patch, 
> HDFS-14314-trunk.002.patch, HDFS-14314.0.patch, HDFS-14314.2.patch, 
> HDFS-14314.patch
>
>
>   since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
> block lease id from active NN before sending full block to NN. Then DN will 
> send full block report together with lease id. If the lease id is invalid, NN 
> will reject the full block report and log "not in the pending set".
>   In a case when DN is doing full block reporting while NN is restarted. 
> It happens that DN will later send a full block report with lease id 
> ,acquired from previous NN instance, which is invalid to the new NN instance. 
> Though DN recognized the new NN instance by heartbeat and reregister itself, 
> it did not reset the lease id from previous instance.
>   The issuse may cause DNs to temporarily go dead, making it unsafe to 
> restart NN especially in hadoop cluster which has large amount of DNs. 
> HDFS-12914 reported the issue  without any clues why it occurred and remain 
> unsolved.
>    To make it clear, look at code below. We take it from method 
> offerService of class BPServiceActor. We eliminate some code to focus on 
> current issue. fullBlockReportLeaseId is a local variable to hold lease id 
> from NN. Exceptions will occur at blockReport call when NN restarting, which 
> will be caught by catch block in while loop. Thus fullBlockReportLeaseId will 
> not be set to 0. After NN restarted, DN will send full block report which 
> will be rejected by the new NN instance. DN will never send full block report 
> until the next full block report schedule, about an hour later.
>   Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
> exception or after registering to NN. Thus it will ask for a valid 
> fullBlockReportLeaseId from new NN instance.
> {code:java}
> private void offerService() throws Exception {
>   long fullBlockReportLeaseId = 0;
>   //
>   // Now loop for a long time
>   //
>   while (shouldRun()) {
> try {
>   final long startTime = scheduler.monotonicNow();
>   //
>   // Every so often, send heartbeat or block-report
>   //
>   final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
>   HeartbeatResponse resp = null;
>   if (sendHeartbeat) {
>   
> boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
> scheduler.isBlockReportDue(startTime);
> scheduler.scheduleNextHeartbeat();
> if (!dn.areHeartbeatsDisabledForTests()) {
>   resp = sendHeartBeat(requestBlockReportLease);
>   assert resp != null;
>   if (resp.getFullBlockReportLeaseId() != 0) {
> if (fullBlockReportLeaseId != 0) {
>   LOG.warn(nnAddr + " sent back a full block report lease " +
>   "ID of 0x" +
>   Long.toHexString(resp.getFullBlockReportLeaseId()) +
>   ", but we already have a lease ID of 0x" +
>   Long.toHexString(fullBlockReportLeaseId) + ". " +
>   "Overwriting old lease ID.");
> }
> fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
>   }
>  
> }
>   }
>
>  
>   if ((fullBlockReportLeaseId != 0) || forceFullBr) {
> //Exception occurred here when NN restarting
> cmds = blockReport(fullBlockReportLeaseId);
> fullBlockReportLeaseId = 0;
>   }
>   
> } catch(RemoteException re) {
>   
>   } // while (shouldRun())
> } // offerService{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14314) fullBlockReportLeaseId should be reset after registering to NN

2019-02-26 Thread star (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

star updated HDFS-14314:

Attachment: HDFS-14314-trunk.001.patch
Status: Patch Available  (was: Open)

> fullBlockReportLeaseId should be reset after registering to NN
> --
>
> Key: HDFS-14314
> URL: https://issues.apache.org/jira/browse/HDFS-14314
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.8.4
> Environment:  
>  
>  
>Reporter: star
>Priority: Critical
> Fix For: 2.8.4
>
> Attachments: HDFS-14314-trunk.001.patch, HDFS-14314-trunk.001.patch, 
> HDFS-14314.0.patch, HDFS-14314.2.patch, HDFS-14314.patch
>
>
>   since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
> block lease id from active NN before sending full block to NN. Then DN will 
> send full block report together with lease id. If the lease id is invalid, NN 
> will reject the full block report and log "not in the pending set".
>   In a case when DN is doing full block reporting while NN is restarted. 
> It happens that DN will later send a full block report with lease id 
> ,acquired from previous NN instance, which is invalid to the new NN instance. 
> Though DN recognized the new NN instance by heartbeat and reregister itself, 
> it did not reset the lease id from previous instance.
>   The issuse may cause DNs to temporarily go dead, making it unsafe to 
> restart NN especially in hadoop cluster which has large amount of DNs. 
> HDFS-12914 reported the issue  without any clues why it occurred and remain 
> unsolved.
>    To make it clear, look at code below. We take it from method 
> offerService of class BPServiceActor. We eliminate some code to focus on 
> current issue. fullBlockReportLeaseId is a local variable to hold lease id 
> from NN. Exceptions will occur at blockReport call when NN restarting, which 
> will be caught by catch block in while loop. Thus fullBlockReportLeaseId will 
> not be set to 0. After NN restarted, DN will send full block report which 
> will be rejected by the new NN instance. DN will never send full block report 
> until the next full block report schedule, about an hour later.
>   Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
> exception or after registering to NN. Thus it will ask for a valid 
> fullBlockReportLeaseId from new NN instance.
> {code:java}
> private void offerService() throws Exception {
>   long fullBlockReportLeaseId = 0;
>   //
>   // Now loop for a long time
>   //
>   while (shouldRun()) {
> try {
>   final long startTime = scheduler.monotonicNow();
>   //
>   // Every so often, send heartbeat or block-report
>   //
>   final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
>   HeartbeatResponse resp = null;
>   if (sendHeartbeat) {
>   
> boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
> scheduler.isBlockReportDue(startTime);
> scheduler.scheduleNextHeartbeat();
> if (!dn.areHeartbeatsDisabledForTests()) {
>   resp = sendHeartBeat(requestBlockReportLease);
>   assert resp != null;
>   if (resp.getFullBlockReportLeaseId() != 0) {
> if (fullBlockReportLeaseId != 0) {
>   LOG.warn(nnAddr + " sent back a full block report lease " +
>   "ID of 0x" +
>   Long.toHexString(resp.getFullBlockReportLeaseId()) +
>   ", but we already have a lease ID of 0x" +
>   Long.toHexString(fullBlockReportLeaseId) + ". " +
>   "Overwriting old lease ID.");
> }
> fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
>   }
>  
> }
>   }
>
>  
>   if ((fullBlockReportLeaseId != 0) || forceFullBr) {
> //Exception occurred here when NN restarting
> cmds = blockReport(fullBlockReportLeaseId);
> fullBlockReportLeaseId = 0;
>   }
>   
> } catch(RemoteException re) {
>   
>   } // while (shouldRun())
> } // offerService{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14314) fullBlockReportLeaseId should be reset after registering to NN

2019-02-26 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16777692#comment-16777692
 ] 

star commented on HDFS-14314:
-

fix code style and mvn checkstyle passed

> fullBlockReportLeaseId should be reset after registering to NN
> --
>
> Key: HDFS-14314
> URL: https://issues.apache.org/jira/browse/HDFS-14314
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.8.4
> Environment:  
>  
>  
>Reporter: star
>Priority: Critical
> Fix For: 2.8.4
>
> Attachments: HDFS-14314-trunk.001.patch, HDFS-14314.0.patch, 
> HDFS-14314.2.patch, HDFS-14314.patch
>
>
>   since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
> block lease id from active NN before sending full block to NN. Then DN will 
> send full block report together with lease id. If the lease id is invalid, NN 
> will reject the full block report and log "not in the pending set".
>   In a case when DN is doing full block reporting while NN is restarted. 
> It happens that DN will later send a full block report with lease id 
> ,acquired from previous NN instance, which is invalid to the new NN instance. 
> Though DN recognized the new NN instance by heartbeat and reregister itself, 
> it did not reset the lease id from previous instance.
>   The issuse may cause DNs to temporarily go dead, making it unsafe to 
> restart NN especially in hadoop cluster which has large amount of DNs. 
> HDFS-12914 reported the issue  without any clues why it occurred and remain 
> unsolved.
>    To make it clear, look at code below. We take it from method 
> offerService of class BPServiceActor. We eliminate some code to focus on 
> current issue. fullBlockReportLeaseId is a local variable to hold lease id 
> from NN. Exceptions will occur at blockReport call when NN restarting, which 
> will be caught by catch block in while loop. Thus fullBlockReportLeaseId will 
> not be set to 0. After NN restarted, DN will send full block report which 
> will be rejected by the new NN instance. DN will never send full block report 
> until the next full block report schedule, about an hour later.
>   Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
> exception or after registering to NN. Thus it will ask for a valid 
> fullBlockReportLeaseId from new NN instance.
> {code:java}
> private void offerService() throws Exception {
>   long fullBlockReportLeaseId = 0;
>   //
>   // Now loop for a long time
>   //
>   while (shouldRun()) {
> try {
>   final long startTime = scheduler.monotonicNow();
>   //
>   // Every so often, send heartbeat or block-report
>   //
>   final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
>   HeartbeatResponse resp = null;
>   if (sendHeartbeat) {
>   
> boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
> scheduler.isBlockReportDue(startTime);
> scheduler.scheduleNextHeartbeat();
> if (!dn.areHeartbeatsDisabledForTests()) {
>   resp = sendHeartBeat(requestBlockReportLease);
>   assert resp != null;
>   if (resp.getFullBlockReportLeaseId() != 0) {
> if (fullBlockReportLeaseId != 0) {
>   LOG.warn(nnAddr + " sent back a full block report lease " +
>   "ID of 0x" +
>   Long.toHexString(resp.getFullBlockReportLeaseId()) +
>   ", but we already have a lease ID of 0x" +
>   Long.toHexString(fullBlockReportLeaseId) + ". " +
>   "Overwriting old lease ID.");
> }
> fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
>   }
>  
> }
>   }
>
>  
>   if ((fullBlockReportLeaseId != 0) || forceFullBr) {
> //Exception occurred here when NN restarting
> cmds = blockReport(fullBlockReportLeaseId);
> fullBlockReportLeaseId = 0;
>   }
>   
> } catch(RemoteException re) {
>   
>   } // while (shouldRun())
> } // offerService{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14314) fullBlockReportLeaseId should be reset after registering to NN

2019-02-25 Thread star (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

star updated HDFS-14314:

Attachment: HDFS-14314-trunk.001.patch

> fullBlockReportLeaseId should be reset after registering to NN
> --
>
> Key: HDFS-14314
> URL: https://issues.apache.org/jira/browse/HDFS-14314
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.8.4
> Environment:  
>  
>  
>Reporter: star
>Priority: Critical
> Fix For: 2.8.4
>
> Attachments: HDFS-14314-trunk.001.patch, HDFS-14314.0.patch, 
> HDFS-14314.2.patch, HDFS-14314.patch
>
>
>   since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
> block lease id from active NN before sending full block to NN. Then DN will 
> send full block report together with lease id. If the lease id is invalid, NN 
> will reject the full block report and log "not in the pending set".
>   In a case when DN is doing full block reporting while NN is restarted. 
> It happens that DN will later send a full block report with lease id 
> ,acquired from previous NN instance, which is invalid to the new NN instance. 
> Though DN recognized the new NN instance by heartbeat and reregister itself, 
> it did not reset the lease id from previous instance.
>   The issuse may cause DNs to temporarily go dead, making it unsafe to 
> restart NN especially in hadoop cluster which has large amount of DNs. 
> HDFS-12914 reported the issue  without any clues why it occurred and remain 
> unsolved.
>    To make it clear, look at code below. We take it from method 
> offerService of class BPServiceActor. We eliminate some code to focus on 
> current issue. fullBlockReportLeaseId is a local variable to hold lease id 
> from NN. Exceptions will occur at blockReport call when NN restarting, which 
> will be caught by catch block in while loop. Thus fullBlockReportLeaseId will 
> not be set to 0. After NN restarted, DN will send full block report which 
> will be rejected by the new NN instance. DN will never send full block report 
> until the next full block report schedule, about an hour later.
>   Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
> exception or after registering to NN. Thus it will ask for a valid 
> fullBlockReportLeaseId from new NN instance.
> {code:java}
> private void offerService() throws Exception {
>   long fullBlockReportLeaseId = 0;
>   //
>   // Now loop for a long time
>   //
>   while (shouldRun()) {
> try {
>   final long startTime = scheduler.monotonicNow();
>   //
>   // Every so often, send heartbeat or block-report
>   //
>   final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
>   HeartbeatResponse resp = null;
>   if (sendHeartbeat) {
>   
> boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
> scheduler.isBlockReportDue(startTime);
> scheduler.scheduleNextHeartbeat();
> if (!dn.areHeartbeatsDisabledForTests()) {
>   resp = sendHeartBeat(requestBlockReportLease);
>   assert resp != null;
>   if (resp.getFullBlockReportLeaseId() != 0) {
> if (fullBlockReportLeaseId != 0) {
>   LOG.warn(nnAddr + " sent back a full block report lease " +
>   "ID of 0x" +
>   Long.toHexString(resp.getFullBlockReportLeaseId()) +
>   ", but we already have a lease ID of 0x" +
>   Long.toHexString(fullBlockReportLeaseId) + ". " +
>   "Overwriting old lease ID.");
> }
> fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
>   }
>  
> }
>   }
>
>  
>   if ((fullBlockReportLeaseId != 0) || forceFullBr) {
> //Exception occurred here when NN restarting
> cmds = blockReport(fullBlockReportLeaseId);
> fullBlockReportLeaseId = 0;
>   }
>   
> } catch(RemoteException re) {
>   
>   } // while (shouldRun())
> } // offerService{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14314) fullBlockReportLeaseId should be reset after registering to NN

2019-02-25 Thread star (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

star updated HDFS-14314:

Attachment: HDFS-14314.2.patch

> fullBlockReportLeaseId should be reset after registering to NN
> --
>
> Key: HDFS-14314
> URL: https://issues.apache.org/jira/browse/HDFS-14314
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.8.4
> Environment:  
>  
>  
>Reporter: star
>Priority: Critical
> Fix For: 2.8.4
>
> Attachments: HDFS-14314.0.patch, HDFS-14314.2.patch, HDFS-14314.patch
>
>
>   since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
> block lease id from active NN before sending full block to NN. Then DN will 
> send full block report together with lease id. If the lease id is invalid, NN 
> will reject the full block report and log "not in the pending set".
>   In a case when DN is doing full block reporting while NN is restarted. 
> It happens that DN will later send a full block report with lease id 
> ,acquired from previous NN instance, which is invalid to the new NN instance. 
> Though DN recognized the new NN instance by heartbeat and reregister itself, 
> it did not reset the lease id from previous instance.
>   The issuse may cause DNs to temporarily go dead, making it unsafe to 
> restart NN especially in hadoop cluster which has large amount of DNs. 
> HDFS-12914 reported the issue  without any clues why it occurred and remain 
> unsolved.
>    To make it clear, look at code below. We take it from method 
> offerService of class BPServiceActor. We eliminate some code to focus on 
> current issue. fullBlockReportLeaseId is a local variable to hold lease id 
> from NN. Exceptions will occur at blockReport call when NN restarting, which 
> will be caught by catch block in while loop. Thus fullBlockReportLeaseId will 
> not be set to 0. After NN restarted, DN will send full block report which 
> will be rejected by the new NN instance. DN will never send full block report 
> until the next full block report schedule, about an hour later.
>   Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
> exception or after registering to NN. Thus it will ask for a valid 
> fullBlockReportLeaseId from new NN instance.
> {code:java}
> private void offerService() throws Exception {
>   long fullBlockReportLeaseId = 0;
>   //
>   // Now loop for a long time
>   //
>   while (shouldRun()) {
> try {
>   final long startTime = scheduler.monotonicNow();
>   //
>   // Every so often, send heartbeat or block-report
>   //
>   final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
>   HeartbeatResponse resp = null;
>   if (sendHeartbeat) {
>   
> boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
> scheduler.isBlockReportDue(startTime);
> scheduler.scheduleNextHeartbeat();
> if (!dn.areHeartbeatsDisabledForTests()) {
>   resp = sendHeartBeat(requestBlockReportLease);
>   assert resp != null;
>   if (resp.getFullBlockReportLeaseId() != 0) {
> if (fullBlockReportLeaseId != 0) {
>   LOG.warn(nnAddr + " sent back a full block report lease " +
>   "ID of 0x" +
>   Long.toHexString(resp.getFullBlockReportLeaseId()) +
>   ", but we already have a lease ID of 0x" +
>   Long.toHexString(fullBlockReportLeaseId) + ". " +
>   "Overwriting old lease ID.");
> }
> fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
>   }
>  
> }
>   }
>
>  
>   if ((fullBlockReportLeaseId != 0) || forceFullBr) {
> //Exception occurred here when NN restarting
> cmds = blockReport(fullBlockReportLeaseId);
> fullBlockReportLeaseId = 0;
>   }
>   
> } catch(RemoteException re) {
>   
>   } // while (shouldRun())
> } // offerService{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14314) fullBlockReportLeaseId should be reset after registering to NN

2019-02-25 Thread star (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

star updated HDFS-14314:

Attachment: (was: HDFS-14314.1.patch)

> fullBlockReportLeaseId should be reset after registering to NN
> --
>
> Key: HDFS-14314
> URL: https://issues.apache.org/jira/browse/HDFS-14314
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.8.4
> Environment:  
>  
>  
>Reporter: star
>Priority: Critical
> Fix For: 2.8.4
>
> Attachments: HDFS-14314.0.patch, HDFS-14314.patch
>
>
>   since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
> block lease id from active NN before sending full block to NN. Then DN will 
> send full block report together with lease id. If the lease id is invalid, NN 
> will reject the full block report and log "not in the pending set".
>   In a case when DN is doing full block reporting while NN is restarted. 
> It happens that DN will later send a full block report with lease id 
> ,acquired from previous NN instance, which is invalid to the new NN instance. 
> Though DN recognized the new NN instance by heartbeat and reregister itself, 
> it did not reset the lease id from previous instance.
>   The issuse may cause DNs to temporarily go dead, making it unsafe to 
> restart NN especially in hadoop cluster which has large amount of DNs. 
> HDFS-12914 reported the issue  without any clues why it occurred and remain 
> unsolved.
>    To make it clear, look at code below. We take it from method 
> offerService of class BPServiceActor. We eliminate some code to focus on 
> current issue. fullBlockReportLeaseId is a local variable to hold lease id 
> from NN. Exceptions will occur at blockReport call when NN restarting, which 
> will be caught by catch block in while loop. Thus fullBlockReportLeaseId will 
> not be set to 0. After NN restarted, DN will send full block report which 
> will be rejected by the new NN instance. DN will never send full block report 
> until the next full block report schedule, about an hour later.
>   Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
> exception or after registering to NN. Thus it will ask for a valid 
> fullBlockReportLeaseId from new NN instance.
> {code:java}
> private void offerService() throws Exception {
>   long fullBlockReportLeaseId = 0;
>   //
>   // Now loop for a long time
>   //
>   while (shouldRun()) {
> try {
>   final long startTime = scheduler.monotonicNow();
>   //
>   // Every so often, send heartbeat or block-report
>   //
>   final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
>   HeartbeatResponse resp = null;
>   if (sendHeartbeat) {
>   
> boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
> scheduler.isBlockReportDue(startTime);
> scheduler.scheduleNextHeartbeat();
> if (!dn.areHeartbeatsDisabledForTests()) {
>   resp = sendHeartBeat(requestBlockReportLease);
>   assert resp != null;
>   if (resp.getFullBlockReportLeaseId() != 0) {
> if (fullBlockReportLeaseId != 0) {
>   LOG.warn(nnAddr + " sent back a full block report lease " +
>   "ID of 0x" +
>   Long.toHexString(resp.getFullBlockReportLeaseId()) +
>   ", but we already have a lease ID of 0x" +
>   Long.toHexString(fullBlockReportLeaseId) + ". " +
>   "Overwriting old lease ID.");
> }
> fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
>   }
>  
> }
>   }
>
>  
>   if ((fullBlockReportLeaseId != 0) || forceFullBr) {
> //Exception occurred here when NN restarting
> cmds = blockReport(fullBlockReportLeaseId);
> fullBlockReportLeaseId = 0;
>   }
>   
> } catch(RemoteException re) {
>   
>   } // while (shouldRun())
> } // offerService{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14314) fullBlockReportLeaseId should be reset after registering to NN

2019-02-25 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16776898#comment-16776898
 ] 

star commented on HDFS-14314:
-

unit test added.

> fullBlockReportLeaseId should be reset after registering to NN
> --
>
> Key: HDFS-14314
> URL: https://issues.apache.org/jira/browse/HDFS-14314
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.8.4
> Environment:  
>  
>  
>Reporter: star
>Priority: Critical
> Fix For: 2.8.4
>
> Attachments: HDFS-14314.0.patch, HDFS-14314.1.patch, HDFS-14314.patch
>
>
>   since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
> block lease id from active NN before sending full block to NN. Then DN will 
> send full block report together with lease id. If the lease id is invalid, NN 
> will reject the full block report and log "not in the pending set".
>   In a case when DN is doing full block reporting while NN is restarted. 
> It happens that DN will later send a full block report with lease id 
> ,acquired from previous NN instance, which is invalid to the new NN instance. 
> Though DN recognized the new NN instance by heartbeat and reregister itself, 
> it did not reset the lease id from previous instance.
>   The issuse may cause DNs to temporarily go dead, making it unsafe to 
> restart NN especially in hadoop cluster which has large amount of DNs. 
> HDFS-12914 reported the issue  without any clues why it occurred and remain 
> unsolved.
>    To make it clear, look at code below. We take it from method 
> offerService of class BPServiceActor. We eliminate some code to focus on 
> current issue. fullBlockReportLeaseId is a local variable to hold lease id 
> from NN. Exceptions will occur at blockReport call when NN restarting, which 
> will be caught by catch block in while loop. Thus fullBlockReportLeaseId will 
> not be set to 0. After NN restarted, DN will send full block report which 
> will be rejected by the new NN instance. DN will never send full block report 
> until the next full block report schedule, about an hour later.
>   Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
> exception or after registering to NN. Thus it will ask for a valid 
> fullBlockReportLeaseId from new NN instance.
> {code:java}
> private void offerService() throws Exception {
>   long fullBlockReportLeaseId = 0;
>   //
>   // Now loop for a long time
>   //
>   while (shouldRun()) {
> try {
>   final long startTime = scheduler.monotonicNow();
>   //
>   // Every so often, send heartbeat or block-report
>   //
>   final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
>   HeartbeatResponse resp = null;
>   if (sendHeartbeat) {
>   
> boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
> scheduler.isBlockReportDue(startTime);
> scheduler.scheduleNextHeartbeat();
> if (!dn.areHeartbeatsDisabledForTests()) {
>   resp = sendHeartBeat(requestBlockReportLease);
>   assert resp != null;
>   if (resp.getFullBlockReportLeaseId() != 0) {
> if (fullBlockReportLeaseId != 0) {
>   LOG.warn(nnAddr + " sent back a full block report lease " +
>   "ID of 0x" +
>   Long.toHexString(resp.getFullBlockReportLeaseId()) +
>   ", but we already have a lease ID of 0x" +
>   Long.toHexString(fullBlockReportLeaseId) + ". " +
>   "Overwriting old lease ID.");
> }
> fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
>   }
>  
> }
>   }
>
>  
>   if ((fullBlockReportLeaseId != 0) || forceFullBr) {
> //Exception occurred here when NN restarting
> cmds = blockReport(fullBlockReportLeaseId);
> fullBlockReportLeaseId = 0;
>   }
>   
> } catch(RemoteException re) {
>   
>   } // while (shouldRun())
> } // offerService{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14314) fullBlockReportLeaseId should be reset after registering to NN

2019-02-25 Thread star (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

star updated HDFS-14314:

Attachment: HDFS-14314.1.patch

> fullBlockReportLeaseId should be reset after registering to NN
> --
>
> Key: HDFS-14314
> URL: https://issues.apache.org/jira/browse/HDFS-14314
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.8.4
> Environment:  
>  
>  
>Reporter: star
>Priority: Critical
> Fix For: 2.8.4
>
> Attachments: HDFS-14314.0.patch, HDFS-14314.1.patch, HDFS-14314.patch
>
>
>   since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
> block lease id from active NN before sending full block to NN. Then DN will 
> send full block report together with lease id. If the lease id is invalid, NN 
> will reject the full block report and log "not in the pending set".
>   In a case when DN is doing full block reporting while NN is restarted. 
> It happens that DN will later send a full block report with lease id 
> ,acquired from previous NN instance, which is invalid to the new NN instance. 
> Though DN recognized the new NN instance by heartbeat and reregister itself, 
> it did not reset the lease id from previous instance.
>   The issuse may cause DNs to temporarily go dead, making it unsafe to 
> restart NN especially in hadoop cluster which has large amount of DNs. 
> HDFS-12914 reported the issue  without any clues why it occurred and remain 
> unsolved.
>    To make it clear, look at code below. We take it from method 
> offerService of class BPServiceActor. We eliminate some code to focus on 
> current issue. fullBlockReportLeaseId is a local variable to hold lease id 
> from NN. Exceptions will occur at blockReport call when NN restarting, which 
> will be caught by catch block in while loop. Thus fullBlockReportLeaseId will 
> not be set to 0. After NN restarted, DN will send full block report which 
> will be rejected by the new NN instance. DN will never send full block report 
> until the next full block report schedule, about an hour later.
>   Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
> exception or after registering to NN. Thus it will ask for a valid 
> fullBlockReportLeaseId from new NN instance.
> {code:java}
> private void offerService() throws Exception {
>   long fullBlockReportLeaseId = 0;
>   //
>   // Now loop for a long time
>   //
>   while (shouldRun()) {
> try {
>   final long startTime = scheduler.monotonicNow();
>   //
>   // Every so often, send heartbeat or block-report
>   //
>   final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
>   HeartbeatResponse resp = null;
>   if (sendHeartbeat) {
>   
> boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
> scheduler.isBlockReportDue(startTime);
> scheduler.scheduleNextHeartbeat();
> if (!dn.areHeartbeatsDisabledForTests()) {
>   resp = sendHeartBeat(requestBlockReportLease);
>   assert resp != null;
>   if (resp.getFullBlockReportLeaseId() != 0) {
> if (fullBlockReportLeaseId != 0) {
>   LOG.warn(nnAddr + " sent back a full block report lease " +
>   "ID of 0x" +
>   Long.toHexString(resp.getFullBlockReportLeaseId()) +
>   ", but we already have a lease ID of 0x" +
>   Long.toHexString(fullBlockReportLeaseId) + ". " +
>   "Overwriting old lease ID.");
> }
> fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
>   }
>  
> }
>   }
>
>  
>   if ((fullBlockReportLeaseId != 0) || forceFullBr) {
> //Exception occurred here when NN restarting
> cmds = blockReport(fullBlockReportLeaseId);
> fullBlockReportLeaseId = 0;
>   }
>   
> } catch(RemoteException re) {
>   
>   } // while (shouldRun())
> } // offerService{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14314) fullBlockReportLeaseId should be reset after registering to NN

2019-02-24 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16776241#comment-16776241
 ] 

star commented on HDFS-14314:
-

[~hexiaoqiao] Thanks. I will add unit test later.

> fullBlockReportLeaseId should be reset after registering to NN
> --
>
> Key: HDFS-14314
> URL: https://issues.apache.org/jira/browse/HDFS-14314
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.8.4
> Environment:  
>  
>  
>Reporter: star
>Priority: Critical
> Fix For: 2.8.4
>
> Attachments: HDFS-14314.0.patch, HDFS-14314.patch
>
>
>   since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
> block lease id from active NN before sending full block to NN. Then DN will 
> send full block report together with lease id. If the lease id is invalid, NN 
> will reject the full block report and log "not in the pending set".
>   In a case when DN is doing full block reporting while NN is restarted. 
> It happens that DN will later send a full block report with lease id 
> ,acquired from previous NN instance, which is invalid to the new NN instance. 
> Though DN recognized the new NN instance by heartbeat and reregister itself, 
> it did not reset the lease id from previous instance.
>   The issuse may cause DNs to temporarily go dead, making it unsafe to 
> restart NN especially in hadoop cluster which has large amount of DNs. 
> HDFS-12914 reported the issue  without any clues why it occurred and remain 
> unsolved.
>    To make it clear, look at code below. We take it from method 
> offerService of class BPServiceActor. We eliminate some code to focus on 
> current issue. fullBlockReportLeaseId is a local variable to hold lease id 
> from NN. Exceptions will occur at blockReport call when NN restarting, which 
> will be caught by catch block in while loop. Thus fullBlockReportLeaseId will 
> not be set to 0. After NN restarted, DN will send full block report which 
> will be rejected by the new NN instance. DN will never send full block report 
> until the next full block report schedule, about an hour later.
>   Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
> exception or after registering to NN. Thus it will ask for a valid 
> fullBlockReportLeaseId from new NN instance.
> {code:java}
> private void offerService() throws Exception {
>   long fullBlockReportLeaseId = 0;
>   //
>   // Now loop for a long time
>   //
>   while (shouldRun()) {
> try {
>   final long startTime = scheduler.monotonicNow();
>   //
>   // Every so often, send heartbeat or block-report
>   //
>   final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
>   HeartbeatResponse resp = null;
>   if (sendHeartbeat) {
>   
> boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
> scheduler.isBlockReportDue(startTime);
> scheduler.scheduleNextHeartbeat();
> if (!dn.areHeartbeatsDisabledForTests()) {
>   resp = sendHeartBeat(requestBlockReportLease);
>   assert resp != null;
>   if (resp.getFullBlockReportLeaseId() != 0) {
> if (fullBlockReportLeaseId != 0) {
>   LOG.warn(nnAddr + " sent back a full block report lease " +
>   "ID of 0x" +
>   Long.toHexString(resp.getFullBlockReportLeaseId()) +
>   ", but we already have a lease ID of 0x" +
>   Long.toHexString(fullBlockReportLeaseId) + ". " +
>   "Overwriting old lease ID.");
> }
> fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
>   }
>  
> }
>   }
>
>  
>   if ((fullBlockReportLeaseId != 0) || forceFullBr) {
> //Exception occurred here when NN restarting
> cmds = blockReport(fullBlockReportLeaseId);
> fullBlockReportLeaseId = 0;
>   }
>   
> } catch(RemoteException re) {
>   
>   } // while (shouldRun())
> } // offerService{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14314) fullBlockReportLeaseId should be reset after registering to NN

2019-02-24 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16776234#comment-16776234
 ] 

star commented on HDFS-14314:
-

I am confused that tests of TestJournalNodeSync Failed since it has nothing to 
do with my patch. I also test TestJournalNodeSync locally and all test case 
passed. Any one could help?

> fullBlockReportLeaseId should be reset after registering to NN
> --
>
> Key: HDFS-14314
> URL: https://issues.apache.org/jira/browse/HDFS-14314
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.8.4
> Environment:  
>  
>  
>Reporter: star
>Priority: Critical
> Fix For: 2.8.4
>
> Attachments: HDFS-14314.0.patch, HDFS-14314.patch
>
>
>   since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
> block lease id from active NN before sending full block to NN. Then DN will 
> send full block report together with lease id. If the lease id is invalid, NN 
> will reject the full block report and log "not in the pending set".
>   In a case when DN is doing full block reporting while NN is restarted. 
> It happens that DN will later send a full block report with lease id 
> ,acquired from previous NN instance, which is invalid to the new NN instance. 
> Though DN recognized the new NN instance by heartbeat and reregister itself, 
> it did not reset the lease id from previous instance.
>   The issuse may cause DNs to temporarily go dead, making it unsafe to 
> restart NN especially in hadoop cluster which has large amount of DNs. 
> HDFS-12914 reported the issue  without any clues why it occurred and remain 
> unsolved.
>    To make it clear, look at code below. We take it from method 
> offerService of class BPServiceActor. We eliminate some code to focus on 
> current issue. fullBlockReportLeaseId is a local variable to hold lease id 
> from NN. Exceptions will occur at blockReport call when NN restarting, which 
> will be caught by catch block in while loop. Thus fullBlockReportLeaseId will 
> not be set to 0. After NN restarted, DN will send full block report which 
> will be rejected by the new NN instance. DN will never send full block report 
> until the next full block report schedule, about an hour later.
>   Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
> exception or after registering to NN. Thus it will ask for a valid 
> fullBlockReportLeaseId from new NN instance.
> {code:java}
> private void offerService() throws Exception {
>   long fullBlockReportLeaseId = 0;
>   //
>   // Now loop for a long time
>   //
>   while (shouldRun()) {
> try {
>   final long startTime = scheduler.monotonicNow();
>   //
>   // Every so often, send heartbeat or block-report
>   //
>   final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
>   HeartbeatResponse resp = null;
>   if (sendHeartbeat) {
>   
> boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
> scheduler.isBlockReportDue(startTime);
> scheduler.scheduleNextHeartbeat();
> if (!dn.areHeartbeatsDisabledForTests()) {
>   resp = sendHeartBeat(requestBlockReportLease);
>   assert resp != null;
>   if (resp.getFullBlockReportLeaseId() != 0) {
> if (fullBlockReportLeaseId != 0) {
>   LOG.warn(nnAddr + " sent back a full block report lease " +
>   "ID of 0x" +
>   Long.toHexString(resp.getFullBlockReportLeaseId()) +
>   ", but we already have a lease ID of 0x" +
>   Long.toHexString(fullBlockReportLeaseId) + ". " +
>   "Overwriting old lease ID.");
> }
> fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
>   }
>  
> }
>   }
>
>  
>   if ((fullBlockReportLeaseId != 0) || forceFullBr) {
> //Exception occurred here when NN restarting
> cmds = blockReport(fullBlockReportLeaseId);
> fullBlockReportLeaseId = 0;
>   }
>   
> } catch(RemoteException re) {
>   
>   } // while (shouldRun())
> } // offerService{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14314) fullBlockReportLeaseId should be reset after registering to NN

2019-02-24 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16776151#comment-16776151
 ] 

star commented on HDFS-14314:
-

Reuploaded the patch with file format changed.

> fullBlockReportLeaseId should be reset after registering to NN
> --
>
> Key: HDFS-14314
> URL: https://issues.apache.org/jira/browse/HDFS-14314
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.8.4
> Environment:  
>  
>  
>Reporter: star
>Priority: Critical
> Fix For: 2.8.4
>
> Attachments: HDFS-14314.0.patch, HDFS-14314.patch
>
>
>   since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
> block lease id from active NN before sending full block to NN. Then DN will 
> send full block report together with lease id. If the lease id is invalid, NN 
> will reject the full block report and log "not in the pending set".
>   In a case when DN is doing full block reporting while NN is restarted. 
> It happens that DN will later send a full block report with lease id 
> ,acquired from previous NN instance, which is invalid to the new NN instance. 
> Though DN recognized the new NN instance by heartbeat and reregister itself, 
> it did not reset the lease id from previous instance.
>   The issuse may cause DNs to temporarily go dead, making it unsafe to 
> restart NN especially in hadoop cluster which has large amount of DNs. 
> HDFS-12914 reported the issue  without any clues why it occurred and remain 
> unsolved.
>    To make it clear, look at code below. We take it from method 
> offerService of class BPServiceActor. We eliminate some code to focus on 
> current issue. fullBlockReportLeaseId is a local variable to hold lease id 
> from NN. Exceptions will occur at blockReport call when NN restarting, which 
> will be caught by catch block in while loop. Thus fullBlockReportLeaseId will 
> not be set to 0. After NN restarted, DN will send full block report which 
> will be rejected by the new NN instance. DN will never send full block report 
> until the next full block report schedule, about an hour later.
>   Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
> exception or after registering to NN. Thus it will ask for a valid 
> fullBlockReportLeaseId from new NN instance.
> {code:java}
> private void offerService() throws Exception {
>   long fullBlockReportLeaseId = 0;
>   //
>   // Now loop for a long time
>   //
>   while (shouldRun()) {
> try {
>   final long startTime = scheduler.monotonicNow();
>   //
>   // Every so often, send heartbeat or block-report
>   //
>   final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
>   HeartbeatResponse resp = null;
>   if (sendHeartbeat) {
>   
> boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
> scheduler.isBlockReportDue(startTime);
> scheduler.scheduleNextHeartbeat();
> if (!dn.areHeartbeatsDisabledForTests()) {
>   resp = sendHeartBeat(requestBlockReportLease);
>   assert resp != null;
>   if (resp.getFullBlockReportLeaseId() != 0) {
> if (fullBlockReportLeaseId != 0) {
>   LOG.warn(nnAddr + " sent back a full block report lease " +
>   "ID of 0x" +
>   Long.toHexString(resp.getFullBlockReportLeaseId()) +
>   ", but we already have a lease ID of 0x" +
>   Long.toHexString(fullBlockReportLeaseId) + ". " +
>   "Overwriting old lease ID.");
> }
> fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
>   }
>  
> }
>   }
>
>  
>   if ((fullBlockReportLeaseId != 0) || forceFullBr) {
> //Exception occurred here when NN restarting
> cmds = blockReport(fullBlockReportLeaseId);
> fullBlockReportLeaseId = 0;
>   }
>   
> } catch(RemoteException re) {
>   
>   } // while (shouldRun())
> } // offerService{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14314) fullBlockReportLeaseId should be reset after registering to NN

2019-02-24 Thread star (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

star updated HDFS-14314:

Status: Open  (was: Patch Available)

> fullBlockReportLeaseId should be reset after registering to NN
> --
>
> Key: HDFS-14314
> URL: https://issues.apache.org/jira/browse/HDFS-14314
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.8.4
> Environment:  
>  
>  
>Reporter: star
>Priority: Critical
> Fix For: 2.8.4
>
> Attachments: HDFS-14314.0.patch, HDFS-14314.patch
>
>
>   since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
> block lease id from active NN before sending full block to NN. Then DN will 
> send full block report together with lease id. If the lease id is invalid, NN 
> will reject the full block report and log "not in the pending set".
>   In a case when DN is doing full block reporting while NN is restarted. 
> It happens that DN will later send a full block report with lease id 
> ,acquired from previous NN instance, which is invalid to the new NN instance. 
> Though DN recognized the new NN instance by heartbeat and reregister itself, 
> it did not reset the lease id from previous instance.
>   The issuse may cause DNs to temporarily go dead, making it unsafe to 
> restart NN especially in hadoop cluster which has large amount of DNs. 
> HDFS-12914 reported the issue  without any clues why it occurred and remain 
> unsolved.
>    To make it clear, look at code below. We take it from method 
> offerService of class BPServiceActor. We eliminate some code to focus on 
> current issue. fullBlockReportLeaseId is a local variable to hold lease id 
> from NN. Exceptions will occur at blockReport call when NN restarting, which 
> will be caught by catch block in while loop. Thus fullBlockReportLeaseId will 
> not be set to 0. After NN restarted, DN will send full block report which 
> will be rejected by the new NN instance. DN will never send full block report 
> until the next full block report schedule, about an hour later.
>   Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
> exception or after registering to NN. Thus it will ask for a valid 
> fullBlockReportLeaseId from new NN instance.
> {code:java}
> private void offerService() throws Exception {
>   long fullBlockReportLeaseId = 0;
>   //
>   // Now loop for a long time
>   //
>   while (shouldRun()) {
> try {
>   final long startTime = scheduler.monotonicNow();
>   //
>   // Every so often, send heartbeat or block-report
>   //
>   final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
>   HeartbeatResponse resp = null;
>   if (sendHeartbeat) {
>   
> boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
> scheduler.isBlockReportDue(startTime);
> scheduler.scheduleNextHeartbeat();
> if (!dn.areHeartbeatsDisabledForTests()) {
>   resp = sendHeartBeat(requestBlockReportLease);
>   assert resp != null;
>   if (resp.getFullBlockReportLeaseId() != 0) {
> if (fullBlockReportLeaseId != 0) {
>   LOG.warn(nnAddr + " sent back a full block report lease " +
>   "ID of 0x" +
>   Long.toHexString(resp.getFullBlockReportLeaseId()) +
>   ", but we already have a lease ID of 0x" +
>   Long.toHexString(fullBlockReportLeaseId) + ". " +
>   "Overwriting old lease ID.");
> }
> fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
>   }
>  
> }
>   }
>
>  
>   if ((fullBlockReportLeaseId != 0) || forceFullBr) {
> //Exception occurred here when NN restarting
> cmds = blockReport(fullBlockReportLeaseId);
> fullBlockReportLeaseId = 0;
>   }
>   
> } catch(RemoteException re) {
>   
>   } // while (shouldRun())
> } // offerService{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14314) fullBlockReportLeaseId should be reset after registering to NN

2019-02-23 Thread star (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

star updated HDFS-14314:

Attachment: HDFS-14314.0.patch

> fullBlockReportLeaseId should be reset after registering to NN
> --
>
> Key: HDFS-14314
> URL: https://issues.apache.org/jira/browse/HDFS-14314
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.8.4
> Environment:  
>  
>  
>Reporter: star
>Priority: Critical
> Fix For: 2.8.4
>
> Attachments: HDFS-14314.0.patch, HDFS-14314.patch
>
>
>   since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
> block lease id from active NN before sending full block to NN. Then DN will 
> send full block report together with lease id. If the lease id is invalid, NN 
> will reject the full block report and log "not in the pending set".
>   In a case when DN is doing full block reporting while NN is restarted. 
> It happens that DN will later send a full block report with lease id 
> ,acquired from previous NN instance, which is invalid to the new NN instance. 
> Though DN recognized the new NN instance by heartbeat and reregister itself, 
> it did not reset the lease id from previous instance.
>   The issuse may cause DNs to temporarily go dead, making it unsafe to 
> restart NN especially in hadoop cluster which has large amount of DNs. 
> HDFS-12914 reported the issue  without any clues why it occurred and remain 
> unsolved.
>    To make it clear, look at code below. We take it from method 
> offerService of class BPServiceActor. We eliminate some code to focus on 
> current issue. fullBlockReportLeaseId is a local variable to hold lease id 
> from NN. Exceptions will occur at blockReport call when NN restarting, which 
> will be caught by catch block in while loop. Thus fullBlockReportLeaseId will 
> not be set to 0. After NN restarted, DN will send full block report which 
> will be rejected by the new NN instance. DN will never send full block report 
> until the next full block report schedule, about an hour later.
>   Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
> exception or after registering to NN. Thus it will ask for a valid 
> fullBlockReportLeaseId from new NN instance.
> {code:java}
> private void offerService() throws Exception {
>   long fullBlockReportLeaseId = 0;
>   //
>   // Now loop for a long time
>   //
>   while (shouldRun()) {
> try {
>   final long startTime = scheduler.monotonicNow();
>   //
>   // Every so often, send heartbeat or block-report
>   //
>   final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
>   HeartbeatResponse resp = null;
>   if (sendHeartbeat) {
>   
> boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
> scheduler.isBlockReportDue(startTime);
> scheduler.scheduleNextHeartbeat();
> if (!dn.areHeartbeatsDisabledForTests()) {
>   resp = sendHeartBeat(requestBlockReportLease);
>   assert resp != null;
>   if (resp.getFullBlockReportLeaseId() != 0) {
> if (fullBlockReportLeaseId != 0) {
>   LOG.warn(nnAddr + " sent back a full block report lease " +
>   "ID of 0x" +
>   Long.toHexString(resp.getFullBlockReportLeaseId()) +
>   ", but we already have a lease ID of 0x" +
>   Long.toHexString(fullBlockReportLeaseId) + ". " +
>   "Overwriting old lease ID.");
> }
> fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
>   }
>  
> }
>   }
>
>  
>   if ((fullBlockReportLeaseId != 0) || forceFullBr) {
> //Exception occurred here when NN restarting
> cmds = blockReport(fullBlockReportLeaseId);
> fullBlockReportLeaseId = 0;
>   }
>   
> } catch(RemoteException re) {
>   
>   } // while (shouldRun())
> } // offerService{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14314) fullBlockReportLeaseId should be reset after registering to NN

2019-02-23 Thread star (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

star updated HDFS-14314:

Attachment: HDFS-14314.patch

> fullBlockReportLeaseId should be reset after registering to NN
> --
>
> Key: HDFS-14314
> URL: https://issues.apache.org/jira/browse/HDFS-14314
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.8.4
> Environment:  
>  
>  
>Reporter: star
>Priority: Critical
> Fix For: 2.8.4
>
> Attachments: HDFS-14314.patch
>
>
>   since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
> block lease id from active NN before sending full block to NN. Then DN will 
> send full block report together with lease id. If the lease id is invalid, NN 
> will reject the full block report and log "not in the pending set".
>   In a case when DN is doing full block reporting while NN is restarted. 
> It happens that DN will later send a full block report with lease id 
> ,acquired from previous NN instance, which is invalid to the new NN instance. 
> Though DN recognized the new NN instance by heartbeat and reregister itself, 
> it did not reset the lease id from previous instance.
>   The issuse may cause DNs to temporarily go dead, making it unsafe to 
> restart NN especially in hadoop cluster which has large amount of DNs. 
> HDFS-12914 reported the issue  without any clues why it occurred and remain 
> unsolved.
>    To make it clear, look at code below. We take it from method 
> offerService of class BPServiceActor. We eliminate some code to focus on 
> current issue. fullBlockReportLeaseId is a local variable to hold lease id 
> from NN. Exceptions will occur at blockReport call when NN restarting, which 
> will be caught by catch block in while loop. Thus fullBlockReportLeaseId will 
> not be set to 0. After NN restarted, DN will send full block report which 
> will be rejected by the new NN instance. DN will never send full block report 
> until the next full block report schedule, about an hour later.
>   Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
> exception or after registering to NN. Thus it will ask for a valid 
> fullBlockReportLeaseId from new NN instance.
> {code:java}
> private void offerService() throws Exception {
>   long fullBlockReportLeaseId = 0;
>   //
>   // Now loop for a long time
>   //
>   while (shouldRun()) {
> try {
>   final long startTime = scheduler.monotonicNow();
>   //
>   // Every so often, send heartbeat or block-report
>   //
>   final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
>   HeartbeatResponse resp = null;
>   if (sendHeartbeat) {
>   
> boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
> scheduler.isBlockReportDue(startTime);
> scheduler.scheduleNextHeartbeat();
> if (!dn.areHeartbeatsDisabledForTests()) {
>   resp = sendHeartBeat(requestBlockReportLease);
>   assert resp != null;
>   if (resp.getFullBlockReportLeaseId() != 0) {
> if (fullBlockReportLeaseId != 0) {
>   LOG.warn(nnAddr + " sent back a full block report lease " +
>   "ID of 0x" +
>   Long.toHexString(resp.getFullBlockReportLeaseId()) +
>   ", but we already have a lease ID of 0x" +
>   Long.toHexString(fullBlockReportLeaseId) + ". " +
>   "Overwriting old lease ID.");
> }
> fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
>   }
>  
> }
>   }
>
>  
>   if ((fullBlockReportLeaseId != 0) || forceFullBr) {
> //Exception occurred here when NN restarting
> cmds = blockReport(fullBlockReportLeaseId);
> fullBlockReportLeaseId = 0;
>   }
>   
> } catch(RemoteException re) {
>   
>   } // while (shouldRun())
> } // offerService{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14314) fullBlockReportLeaseId should be reset after registering to NN

2019-02-23 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775981#comment-16775981
 ] 

star commented on HDFS-14314:
-

[~jojochuang] patch available now. 

> fullBlockReportLeaseId should be reset after registering to NN
> --
>
> Key: HDFS-14314
> URL: https://issues.apache.org/jira/browse/HDFS-14314
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.8.4
> Environment:  
>  
>  
>Reporter: star
>Priority: Critical
> Fix For: 2.8.4
>
> Attachments: HDFS-14314.patch
>
>
>   since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
> block lease id from active NN before sending full block to NN. Then DN will 
> send full block report together with lease id. If the lease id is invalid, NN 
> will reject the full block report and log "not in the pending set".
>   In a case when DN is doing full block reporting while NN is restarted. 
> It happens that DN will later send a full block report with lease id 
> ,acquired from previous NN instance, which is invalid to the new NN instance. 
> Though DN recognized the new NN instance by heartbeat and reregister itself, 
> it did not reset the lease id from previous instance.
>   The issuse may cause DNs to temporarily go dead, making it unsafe to 
> restart NN especially in hadoop cluster which has large amount of DNs. 
> HDFS-12914 reported the issue  without any clues why it occurred and remain 
> unsolved.
>    To make it clear, look at code below. We take it from method 
> offerService of class BPServiceActor. We eliminate some code to focus on 
> current issue. fullBlockReportLeaseId is a local variable to hold lease id 
> from NN. Exceptions will occur at blockReport call when NN restarting, which 
> will be caught by catch block in while loop. Thus fullBlockReportLeaseId will 
> not be set to 0. After NN restarted, DN will send full block report which 
> will be rejected by the new NN instance. DN will never send full block report 
> until the next full block report schedule, about an hour later.
>   Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
> exception or after registering to NN. Thus it will ask for a valid 
> fullBlockReportLeaseId from new NN instance.
> {code:java}
> private void offerService() throws Exception {
>   long fullBlockReportLeaseId = 0;
>   //
>   // Now loop for a long time
>   //
>   while (shouldRun()) {
> try {
>   final long startTime = scheduler.monotonicNow();
>   //
>   // Every so often, send heartbeat or block-report
>   //
>   final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
>   HeartbeatResponse resp = null;
>   if (sendHeartbeat) {
>   
> boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
> scheduler.isBlockReportDue(startTime);
> scheduler.scheduleNextHeartbeat();
> if (!dn.areHeartbeatsDisabledForTests()) {
>   resp = sendHeartBeat(requestBlockReportLease);
>   assert resp != null;
>   if (resp.getFullBlockReportLeaseId() != 0) {
> if (fullBlockReportLeaseId != 0) {
>   LOG.warn(nnAddr + " sent back a full block report lease " +
>   "ID of 0x" +
>   Long.toHexString(resp.getFullBlockReportLeaseId()) +
>   ", but we already have a lease ID of 0x" +
>   Long.toHexString(fullBlockReportLeaseId) + ". " +
>   "Overwriting old lease ID.");
> }
> fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
>   }
>  
> }
>   }
>
>  
>   if ((fullBlockReportLeaseId != 0) || forceFullBr) {
> //Exception occurred here when NN restarting
> cmds = blockReport(fullBlockReportLeaseId);
> fullBlockReportLeaseId = 0;
>   }
>   
> } catch(RemoteException re) {
>   
>   } // while (shouldRun())
> } // offerService{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14314) fullBlockReportLeaseId should be reset after registering to NN

2019-02-23 Thread star (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

star updated HDFS-14314:

Fix Version/s: 2.8.4
Affects Version/s: (was: 2.8.0)
   2.8.4
   Status: Patch Available  (was: Open)

> fullBlockReportLeaseId should be reset after registering to NN
> --
>
> Key: HDFS-14314
> URL: https://issues.apache.org/jira/browse/HDFS-14314
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.8.4
> Environment:  
>  
>  
>Reporter: star
>Priority: Critical
> Fix For: 2.8.4
>
>
>   since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
> block lease id from active NN before sending full block to NN. Then DN will 
> send full block report together with lease id. If the lease id is invalid, NN 
> will reject the full block report and log "not in the pending set".
>   In a case when DN is doing full block reporting while NN is restarted. 
> It happens that DN will later send a full block report with lease id 
> ,acquired from previous NN instance, which is invalid to the new NN instance. 
> Though DN recognized the new NN instance by heartbeat and reregister itself, 
> it did not reset the lease id from previous instance.
>   The issuse may cause DNs to temporarily go dead, making it unsafe to 
> restart NN especially in hadoop cluster which has large amount of DNs. 
> HDFS-12914 reported the issue  without any clues why it occurred and remain 
> unsolved.
>    To make it clear, look at code below. We take it from method 
> offerService of class BPServiceActor. We eliminate some code to focus on 
> current issue. fullBlockReportLeaseId is a local variable to hold lease id 
> from NN. Exceptions will occur at blockReport call when NN restarting, which 
> will be caught by catch block in while loop. Thus fullBlockReportLeaseId will 
> not be set to 0. After NN restarted, DN will send full block report which 
> will be rejected by the new NN instance. DN will never send full block report 
> until the next full block report schedule, about an hour later.
>   Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
> exception or after registering to NN. Thus it will ask for a valid 
> fullBlockReportLeaseId from new NN instance.
> {code:java}
> private void offerService() throws Exception {
>   long fullBlockReportLeaseId = 0;
>   //
>   // Now loop for a long time
>   //
>   while (shouldRun()) {
> try {
>   final long startTime = scheduler.monotonicNow();
>   //
>   // Every so often, send heartbeat or block-report
>   //
>   final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
>   HeartbeatResponse resp = null;
>   if (sendHeartbeat) {
>   
> boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
> scheduler.isBlockReportDue(startTime);
> scheduler.scheduleNextHeartbeat();
> if (!dn.areHeartbeatsDisabledForTests()) {
>   resp = sendHeartBeat(requestBlockReportLease);
>   assert resp != null;
>   if (resp.getFullBlockReportLeaseId() != 0) {
> if (fullBlockReportLeaseId != 0) {
>   LOG.warn(nnAddr + " sent back a full block report lease " +
>   "ID of 0x" +
>   Long.toHexString(resp.getFullBlockReportLeaseId()) +
>   ", but we already have a lease ID of 0x" +
>   Long.toHexString(fullBlockReportLeaseId) + ". " +
>   "Overwriting old lease ID.");
> }
> fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
>   }
>  
> }
>   }
>
>  
>   if ((fullBlockReportLeaseId != 0) || forceFullBr) {
> //Exception occurred here when NN restarting
> cmds = blockReport(fullBlockReportLeaseId);
> fullBlockReportLeaseId = 0;
>   }
>   
> } catch(RemoteException re) {
>   
>   } // while (shouldRun())
> } // offerService{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14314) fullBlockReportLeaseId should be reset after registering to NN

2019-02-23 Thread star (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

star updated HDFS-14314:

Summary: fullBlockReportLeaseId should be reset after registering to NN  
(was: fullBlockReportLeaseId should be reset after )

> fullBlockReportLeaseId should be reset after registering to NN
> --
>
> Key: HDFS-14314
> URL: https://issues.apache.org/jira/browse/HDFS-14314
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.8.0
> Environment:  
>  
>  
>Reporter: star
>Priority: Critical
>
>   since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
> block lease id from active NN before sending full block to NN. Then DN will 
> send full block report together with lease id. If the lease id is invalid, NN 
> will reject the full block report and log "not in the pending set".
>   In a case when DN is doing full block reporting while NN is restarted. 
> It happens that DN will later send a full block report with lease id 
> ,acquired from previous NN instance, which is invalid to the new NN instance. 
> Though DN recognized the new NN instance by heartbeat and reregister itself, 
> it did not reset the lease id from previous instance.
>   The issuse may cause DNs to temporarily go dead, making it unsafe to 
> restart NN especially in hadoop cluster which has large amount of DNs. 
> HDFS-12914 reported the issue  without any clues why it occurred and remain 
> unsolved.
>    To make it clear, look at code below. We take it from method 
> offerService of class BPServiceActor. We eliminate some code to focus on 
> current issue. fullBlockReportLeaseId is a local variable to hold lease id 
> from NN. Exceptions will occur at blockReport call when NN restarting, which 
> will be caught by catch block in while loop. Thus fullBlockReportLeaseId will 
> not be set to 0. After NN restarted, DN will send full block report which 
> will be rejected by the new NN instance. DN will never send full block report 
> until the next full block report schedule, about an hour later.
>   Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
> exception or after registering to NN. Thus it will ask for a valid 
> fullBlockReportLeaseId from new NN instance.
> {code:java}
> private void offerService() throws Exception {
>   long fullBlockReportLeaseId = 0;
>   //
>   // Now loop for a long time
>   //
>   while (shouldRun()) {
> try {
>   final long startTime = scheduler.monotonicNow();
>   //
>   // Every so often, send heartbeat or block-report
>   //
>   final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
>   HeartbeatResponse resp = null;
>   if (sendHeartbeat) {
>   
> boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
> scheduler.isBlockReportDue(startTime);
> scheduler.scheduleNextHeartbeat();
> if (!dn.areHeartbeatsDisabledForTests()) {
>   resp = sendHeartBeat(requestBlockReportLease);
>   assert resp != null;
>   if (resp.getFullBlockReportLeaseId() != 0) {
> if (fullBlockReportLeaseId != 0) {
>   LOG.warn(nnAddr + " sent back a full block report lease " +
>   "ID of 0x" +
>   Long.toHexString(resp.getFullBlockReportLeaseId()) +
>   ", but we already have a lease ID of 0x" +
>   Long.toHexString(fullBlockReportLeaseId) + ". " +
>   "Overwriting old lease ID.");
> }
> fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
>   }
>  
> }
>   }
>
>  
>   if ((fullBlockReportLeaseId != 0) || forceFullBr) {
> //Exception occurred here when NN restarting
> cmds = blockReport(fullBlockReportLeaseId);
> fullBlockReportLeaseId = 0;
>   }
>   
> } catch(RemoteException re) {
>   
>   } // while (shouldRun())
> } // offerService{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14314) fullBlockReportLeaseId should be reset after

2019-02-22 Thread star (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

star updated HDFS-14314:

Description: 
  since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
block lease id from active NN before sending full block to NN. Then DN will 
send full block report together with lease id. If the lease id is invalid, NN 
will reject the full block report and log "not in the pending set".

  In a case when DN is doing full block reporting while NN is restarted. It 
happens that DN will later send a full block report with lease id ,acquired 
from previous NN instance, which is invalid to the new NN instance. Though DN 
recognized the new NN instance by heartbeat and reregister itself, it did not 
reset the lease id from previous instance.

  The issuse may cause DNs to temporarily go dead, making it unsafe to 
restart NN especially in hadoop cluster which has large amount of DNs. 
HDFS-12914 reported the issue  without any clues why it occurred and remain 
unsolved.

   To make it clear, look at code below. We take it from method 
offerService of class BPServiceActor. We eliminate some code to focus on 
current issue. fullBlockReportLeaseId is a local variable to hold lease id from 
NN. Exceptions will occur at blockReport call when NN restarting, which will be 
caught by catch block in while loop. Thus fullBlockReportLeaseId will not be 
set to 0. After NN restarted, DN will send full block report which will be 
rejected by the new NN instance. DN will never send full block report until the 
next full block report schedule, about an hour later.

  Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
exception or after registering to NN. Thus it will ask for a valid 
fullBlockReportLeaseId from new NN instance.
{code:java}
private void offerService() throws Exception {

  long fullBlockReportLeaseId = 0;

  //
  // Now loop for a long time
  //
  while (shouldRun()) {
try {
  final long startTime = scheduler.monotonicNow();

  //
  // Every so often, send heartbeat or block-report
  //
  final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
  HeartbeatResponse resp = null;
  if (sendHeartbeat) {
  
boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
scheduler.isBlockReportDue(startTime);
scheduler.scheduleNextHeartbeat();
if (!dn.areHeartbeatsDisabledForTests()) {
  resp = sendHeartBeat(requestBlockReportLease);
  assert resp != null;
  if (resp.getFullBlockReportLeaseId() != 0) {
if (fullBlockReportLeaseId != 0) {
  LOG.warn(nnAddr + " sent back a full block report lease " +
  "ID of 0x" +
  Long.toHexString(resp.getFullBlockReportLeaseId()) +
  ", but we already have a lease ID of 0x" +
  Long.toHexString(fullBlockReportLeaseId) + ". " +
  "Overwriting old lease ID.");
}
fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
  }
 
}
  }
   
 
  if ((fullBlockReportLeaseId != 0) || forceFullBr) {
//Exception occurred here when NN restarting
cmds = blockReport(fullBlockReportLeaseId);
fullBlockReportLeaseId = 0;
  }
  
} catch(RemoteException re) {
  
  } // while (shouldRun())
} // offerService{code}
 

  was:
  since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
block lease id from active NN before sending full block to NN. Then DN will 
send full block report together with lease id. If the lease id is invalid, NN 
will reject the full block report and log "not in the pending set".

  In a case when DN is doing full block reporting while NN is restarted. It 
happens that DN will later send a full block report with lease id ,acquired 
from previous NN instance, which is invalid to the new NN instance. Though DN 
recognized the new NN instance by heartbeat and reregister itself, it did not 
reset the lease id from previous instance.

  The issuse may cause DNs to temporarily go dead, making it unsafe to 
restart NN especially in hadoop cluster which has large amount of DNs. 
HDFS-12914 reported the issue  without any clues why it occurred and remain 
unsolved.

   To make it clear, look at code below. We take it from method 
offerService of class BPServiceActor. We eliminate some code to focus on 
current issue. fullBlockReportLeaseId is a local variable. Exceptions will 
occur at blockReport call when NN restarting, which will be caught by catch 
block in while loop. Thus fullBlockReportLeaseId will not be set to 0. After NN 
restarted, DN will send full block report which will be rejected by the new NN 
instance. DN will never send full block report until the next full block repo

[jira] [Updated] (HDFS-14314) fullBlockReportLeaseId should be reset after

2019-02-22 Thread star (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

star updated HDFS-14314:

Description: 
  since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
block lease id from active NN before sending full block to NN. Then DN will 
send full block report together with lease id. If the lease id is invalid, NN 
will reject the full block report and log "not in the pending set".

  In a case when DN is doing full block reporting while NN is restarted. It 
happens that DN will later send a full block report with lease id ,acquired 
from previous NN instance, which is invalid to the new NN instance. Though DN 
recognized the new NN instance by heartbeat and reregister itself, it did not 
reset the lease id from previous instance.

  The issuse may cause DNs to temporarily go dead, making it unsafe to 
restart NN especially in hadoop cluster which has large amount of DNs. 
HDFS-12914 reported the issue  without any clues why it occurred and remain 
unsolved.

   To make it clear, look at code below. We take it from method 
offerService of class BPServiceActor. We eliminate some code to focus on 
current issue. fullBlockReportLeaseId is a local variable. Exceptions will 
occur at blockReport call when NN restarting, which will be caught by catch 
block in while loop. Thus fullBlockReportLeaseId will not be set to 0. After NN 
restarted, DN will send full block report which will be rejected by the new NN 
instance. DN will never send full block report until the next full block report 
schedule, about an hour later.

  Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
exception or after registering to NN. Thus it will ask for a valid 
fullBlockReportLeaseId from new NN instance.
{code:java}
private void offerService() throws Exception {

  long fullBlockReportLeaseId = 0;

  //
  // Now loop for a long time
  //
  while (shouldRun()) {
try {
  final long startTime = scheduler.monotonicNow();

  //
  // Every so often, send heartbeat or block-report
  //
  final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
  HeartbeatResponse resp = null;
  if (sendHeartbeat) {
  
boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
scheduler.isBlockReportDue(startTime);
scheduler.scheduleNextHeartbeat();
if (!dn.areHeartbeatsDisabledForTests()) {
  resp = sendHeartBeat(requestBlockReportLease);
  assert resp != null;
  if (resp.getFullBlockReportLeaseId() != 0) {
if (fullBlockReportLeaseId != 0) {
  LOG.warn(nnAddr + " sent back a full block report lease " +
  "ID of 0x" +
  Long.toHexString(resp.getFullBlockReportLeaseId()) +
  ", but we already have a lease ID of 0x" +
  Long.toHexString(fullBlockReportLeaseId) + ". " +
  "Overwriting old lease ID.");
}
fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
  }
 
}
  }
   
 
  if ((fullBlockReportLeaseId != 0) || forceFullBr) {
//Exception occurred here when NN restarting
cmds = blockReport(fullBlockReportLeaseId);
fullBlockReportLeaseId = 0;
  }
  
} catch(RemoteException re) {
  
  } // while (shouldRun())
} // offerService{code}
 

  was:
  since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
block lease id from active NN before sending full block to NN. Then DN will 
send full block report together with lease id. If the lease id is invalid, NN 
will reject the full block report and log "not in the pending set".

  In a case when DN is doing full block reporting while NN is restarted. It 
happens that DN will later send a full block report with lease id ,acquired 
from previous NN instance, which is invalid to the new NN instance. Though DN 
recognized the new NN instance by heartbeat and reregister itself, it did not 
reset the lease id from previous instance.

  The issuse may cause DNs to temporarily go dead, making it unsafe to 
restart NN especially in hadoop cluster which has large amount of DNs.

   To make it clear, look at code below. We take it from method 
offerService of class BPServiceActor. We eliminate some code to focus on 
current issue. fullBlockReportLeaseId is a local variable. Exceptions will 
occur at blockReport call when NN restarting, which will be caught by catch 
block in while loop. Thus fullBlockReportLeaseId will not be set to 0. After NN 
restarted, DN will send full block report which will be rejected by the new NN 
instance. DN will never send full block report until the next full block report 
schedule, about an hour later. 

  Solution is simple, just reset fullBlockReportLeaseId to 0 after any 

[jira] [Created] (HDFS-14314) fullBlockReportLeaseId should be reset after

2019-02-22 Thread star (JIRA)
star created HDFS-14314:
---

 Summary: fullBlockReportLeaseId should be reset after 
 Key: HDFS-14314
 URL: https://issues.apache.org/jira/browse/HDFS-14314
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.8.0
 Environment:  

 

 
Reporter: star


  since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
block lease id from active NN before sending full block to NN. Then DN will 
send full block report together with lease id. If the lease id is invalid, NN 
will reject the full block report and log "not in the pending set".

  In a case when DN is doing full block reporting while NN is restarted. It 
happens that DN will later send a full block report with lease id ,acquired 
from previous NN instance, which is invalid to the new NN instance. Though DN 
recognized the new NN instance by heartbeat and reregister itself, it did not 
reset the lease id from previous instance.

  The issuse may cause DNs to temporarily go dead, making it unsafe to 
restart NN especially in hadoop cluster which has large amount of DNs.

   To make it clear, look at code below. We take it from method 
offerService of class BPServiceActor. We eliminate some code to focus on 
current issue. fullBlockReportLeaseId is a local variable. Exceptions will 
occur at blockReport call when NN restarting, which will be caught by catch 
block in while loop. Thus fullBlockReportLeaseId will not be set to 0. After NN 
restarted, DN will send full block report which will be rejected by the new NN 
instance. DN will never send full block report until the next full block report 
schedule, about an hour later. 

  Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
exception or after registering to NN. Thus it will ask for a valid 
fullBlockReportLeaseId from new NN instance.
{code:java}
private void offerService() throws Exception {

  long fullBlockReportLeaseId = 0;

  //
  // Now loop for a long time
  //
  while (shouldRun()) {
try {
  final long startTime = scheduler.monotonicNow();

  //
  // Every so often, send heartbeat or block-report
  //
  final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
  HeartbeatResponse resp = null;
  if (sendHeartbeat) {
  
boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
scheduler.isBlockReportDue(startTime);
scheduler.scheduleNextHeartbeat();
if (!dn.areHeartbeatsDisabledForTests()) {
  resp = sendHeartBeat(requestBlockReportLease);
  assert resp != null;
  if (resp.getFullBlockReportLeaseId() != 0) {
if (fullBlockReportLeaseId != 0) {
  LOG.warn(nnAddr + " sent back a full block report lease " +
  "ID of 0x" +
  Long.toHexString(resp.getFullBlockReportLeaseId()) +
  ", but we already have a lease ID of 0x" +
  Long.toHexString(fullBlockReportLeaseId) + ". " +
  "Overwriting old lease ID.");
}
fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
  }
 
}
  }
   
 
  if ((fullBlockReportLeaseId != 0) || forceFullBr) {
//Exception occurred here when NN restarting
cmds = blockReport(fullBlockReportLeaseId);
fullBlockReportLeaseId = 0;
  }
  
} catch(RemoteException re) {
  
  } // while (shouldRun())
} // offerService{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



<    1   2