[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model

2019-11-13 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973256#comment-16973256
 ] 

Yiqun Lin commented on HDFS-14648:
--

LGTM, +1.

I will hold off the commit until the weekend in case other guys has other 
review comments, :).

> DeadNodeDetector basic model
> 
>
> Key: HDFS-14648
> URL: https://issues.apache.org/jira/browse/HDFS-14648
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14648.001.patch, HDFS-14648.002.patch, 
> HDFS-14648.003.patch, HDFS-14648.004.patch, HDFS-14648.005.patch, 
> HDFS-14648.006.patch, HDFS-14648.007.patch, HDFS-14648.008.patch, 
> HDFS-14648.009.patch, HDFS-14648.010.patch, HDFS-14648.011.patch
>
>
> This Jira constructs DeadNodeDetector state machine model. The function it 
> implements as follow:
>  # When a DFSInputstream is opened, a BlockReader is opened. If some DataNode 
> of the block is found to inaccessible, put the DataNode into 
> DeadNodeDetector#deadnode.(HDFS-14649) will optimize this part. Because when 
> DataNode is not accessible, it is likely that the replica has been removed 
> from the DataNode.Therefore, it needs to be confirmed by re-probing and 
> requires a higher priority processing.
>  # DeadNodeDetector will periodically detect the Node in 
> DeadNodeDetector#deadnode, If the access is successful, the Node will be 
> moved from DeadNodeDetector#deadnode. Continuous detection of the dead node 
> is necessary. The DataNode need rejoin the cluster due to a service 
> restart/machine repair. The DataNode may be permanently excluded if there is 
> no added probe mechanism.
>  # DeadNodeDetector#dfsInputStreamNodes Record the DFSInputstream using 
> DataNode. When the DFSInputstream is closed, it will be moved from 
> DeadNodeDetector#dfsInputStreamNodes.
>  # Every time get the global deanode, update the DeadNodeDetector#deadnode. 
> The new DeadNodeDetector#deadnode Equals to the intersection of the old 
> DeadNodeDetector#deadnode and the Datanodes are by 
> DeadNodeDetector#dfsInputStreamNodes.
>  # DeadNodeDetector has a switch that is turned off by default. When it is 
> closed, each DFSInputstream still uses its own local deadnode.
>  # This feature has been used in the XIAOMI production environment for a long 
> time. Reduced hbase read stuck, due to node hangs.
>  # Just open the DeadNodeDetector switch and you can use it directly. No 
> other restrictions. Don't want to use DeadNodeDetector, just close it.
> {code:java}
> if (sharedDeadNodesEnabled && deadNodeDetector == null) {
>   deadNodeDetector = new DeadNodeDetector(name);
>   deadNodeDetectorThr = new Daemon(deadNodeDetector);
>   deadNodeDetectorThr.start();
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model

2019-11-13 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973100#comment-16973100
 ] 

Hadoop QA commented on HDFS-14648:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
33s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
24s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 
 0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
38s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
58s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
54s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 46s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
47s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
45s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
11s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
14s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  3m 
14s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
50s{color} | {color:green} hadoop-hdfs-project: The patch generated 0 new + 112 
unchanged - 1 fixed = 112 total (was 113) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
43s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
1s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 27s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  5m 
36s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
54s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  2m  
3s{color} | {color:green} hadoop-hdfs-client in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 93m 27s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
45s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}177m  8s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.server.namenode.TestRedudantBlocks |
|   | hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:104ccca9169 |
| JIRA Issue | HDFS-14648 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12985679/HDFS-14648.011.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  xml  |
| uname | Linux 3a216e357809 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 
05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build 

[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model

2019-11-12 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973039#comment-16973039
 ] 

Hadoop QA commented on HDFS-14648:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
57s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
20s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 
41s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
40s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
57s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 49s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
50s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
51s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
11s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
51s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
18s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  3m 
18s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
52s{color} | {color:green} hadoop-hdfs-project: The patch generated 0 new + 112 
unchanged - 1 fixed = 112 total (was 113) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
1s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 17s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  5m 
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m  
2s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  2m  
1s{color} | {color:green} hadoop-hdfs-client in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}115m 48s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
33s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}199m 32s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer |
|   | hadoop.hdfs.server.namenode.TestNamenodeCapacityReport |
|   | hadoop.hdfs.qjournal.server.TestJournalNodeSync |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:104ccca9169 |
| JIRA Issue | HDFS-14648 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12985668/HDFS-14648.010.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  xml  |
| uname | Linux 22f81004446f 4.15.0-66-generic #75-Ubuntu SMP Tue 

[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model

2019-11-12 Thread Lisheng Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973016#comment-16973016
 ] 

Lisheng Sun commented on HDFS-14648:


i updated the patch and uploaded the v011 patch. Could you mind continue to 
review it? Thank you. [~linyiqun]

> DeadNodeDetector basic model
> 
>
> Key: HDFS-14648
> URL: https://issues.apache.org/jira/browse/HDFS-14648
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14648.001.patch, HDFS-14648.002.patch, 
> HDFS-14648.003.patch, HDFS-14648.004.patch, HDFS-14648.005.patch, 
> HDFS-14648.006.patch, HDFS-14648.007.patch, HDFS-14648.008.patch, 
> HDFS-14648.009.patch, HDFS-14648.010.patch, HDFS-14648.011.patch
>
>
> This Jira constructs DeadNodeDetector state machine model. The function it 
> implements as follow:
>  # When a DFSInputstream is opened, a BlockReader is opened. If some DataNode 
> of the block is found to inaccessible, put the DataNode into 
> DeadNodeDetector#deadnode.(HDFS-14649) will optimize this part. Because when 
> DataNode is not accessible, it is likely that the replica has been removed 
> from the DataNode.Therefore, it needs to be confirmed by re-probing and 
> requires a higher priority processing.
>  # DeadNodeDetector will periodically detect the Node in 
> DeadNodeDetector#deadnode, If the access is successful, the Node will be 
> moved from DeadNodeDetector#deadnode. Continuous detection of the dead node 
> is necessary. The DataNode need rejoin the cluster due to a service 
> restart/machine repair. The DataNode may be permanently excluded if there is 
> no added probe mechanism.
>  # DeadNodeDetector#dfsInputStreamNodes Record the DFSInputstream using 
> DataNode. When the DFSInputstream is closed, it will be moved from 
> DeadNodeDetector#dfsInputStreamNodes.
>  # Every time get the global deanode, update the DeadNodeDetector#deadnode. 
> The new DeadNodeDetector#deadnode Equals to the intersection of the old 
> DeadNodeDetector#deadnode and the Datanodes are by 
> DeadNodeDetector#dfsInputStreamNodes.
>  # DeadNodeDetector has a switch that is turned off by default. When it is 
> closed, each DFSInputstream still uses its own local deadnode.
>  # This feature has been used in the XIAOMI production environment for a long 
> time. Reduced hbase read stuck, due to node hangs.
>  # Just open the DeadNodeDetector switch and you can use it directly. No 
> other restrictions. Don't want to use DeadNodeDetector, just close it.
> {code:java}
> if (sharedDeadNodesEnabled && deadNodeDetector == null) {
>   deadNodeDetector = new DeadNodeDetector(name);
>   deadNodeDetectorThr = new Daemon(deadNodeDetector);
>   deadNodeDetectorThr.start();
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model

2019-11-12 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16972994#comment-16972994
 ] 

Yiqun Lin commented on HDFS-14648:
--

[~leosun08], sorry for the confused, you are right. Please remove this change 
in DFSStripedInputStream and address other comments. Thanks.

> DeadNodeDetector basic model
> 
>
> Key: HDFS-14648
> URL: https://issues.apache.org/jira/browse/HDFS-14648
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14648.001.patch, HDFS-14648.002.patch, 
> HDFS-14648.003.patch, HDFS-14648.004.patch, HDFS-14648.005.patch, 
> HDFS-14648.006.patch, HDFS-14648.007.patch, HDFS-14648.008.patch, 
> HDFS-14648.009.patch, HDFS-14648.010.patch
>
>
> This Jira constructs DeadNodeDetector state machine model. The function it 
> implements as follow:
>  # When a DFSInputstream is opened, a BlockReader is opened. If some DataNode 
> of the block is found to inaccessible, put the DataNode into 
> DeadNodeDetector#deadnode.(HDFS-14649) will optimize this part. Because when 
> DataNode is not accessible, it is likely that the replica has been removed 
> from the DataNode.Therefore, it needs to be confirmed by re-probing and 
> requires a higher priority processing.
>  # DeadNodeDetector will periodically detect the Node in 
> DeadNodeDetector#deadnode, If the access is successful, the Node will be 
> moved from DeadNodeDetector#deadnode. Continuous detection of the dead node 
> is necessary. The DataNode need rejoin the cluster due to a service 
> restart/machine repair. The DataNode may be permanently excluded if there is 
> no added probe mechanism.
>  # DeadNodeDetector#dfsInputStreamNodes Record the DFSInputstream using 
> DataNode. When the DFSInputstream is closed, it will be moved from 
> DeadNodeDetector#dfsInputStreamNodes.
>  # Every time get the global deanode, update the DeadNodeDetector#deadnode. 
> The new DeadNodeDetector#deadnode Equals to the intersection of the old 
> DeadNodeDetector#deadnode and the Datanodes are by 
> DeadNodeDetector#dfsInputStreamNodes.
>  # DeadNodeDetector has a switch that is turned off by default. When it is 
> closed, each DFSInputstream still uses its own local deadnode.
>  # This feature has been used in the XIAOMI production environment for a long 
> time. Reduced hbase read stuck, due to node hangs.
>  # Just open the DeadNodeDetector switch and you can use it directly. No 
> other restrictions. Don't want to use DeadNodeDetector, just close it.
> {code:java}
> if (sharedDeadNodesEnabled && deadNodeDetector == null) {
>   deadNodeDetector = new DeadNodeDetector(name);
>   deadNodeDetectorThr = new Daemon(deadNodeDetector);
>   deadNodeDetectorThr.start();
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model

2019-11-12 Thread Lisheng Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16972990#comment-16972990
 ] 

Lisheng Sun commented on HDFS-14648:


hi [~linyiqun]
{quote}
DFSInputStream.java
I haven't seen the call dfsClient.addNodeToDeadNodeDetector added in method 
createBlockReader under this class
{quote}
i do not find the method DFSInputStream#createBlockReader. createBlockReader 
should be in DFSStripedInputStream.

> DeadNodeDetector basic model
> 
>
> Key: HDFS-14648
> URL: https://issues.apache.org/jira/browse/HDFS-14648
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14648.001.patch, HDFS-14648.002.patch, 
> HDFS-14648.003.patch, HDFS-14648.004.patch, HDFS-14648.005.patch, 
> HDFS-14648.006.patch, HDFS-14648.007.patch, HDFS-14648.008.patch, 
> HDFS-14648.009.patch, HDFS-14648.010.patch
>
>
> This Jira constructs DeadNodeDetector state machine model. The function it 
> implements as follow:
>  # When a DFSInputstream is opened, a BlockReader is opened. If some DataNode 
> of the block is found to inaccessible, put the DataNode into 
> DeadNodeDetector#deadnode.(HDFS-14649) will optimize this part. Because when 
> DataNode is not accessible, it is likely that the replica has been removed 
> from the DataNode.Therefore, it needs to be confirmed by re-probing and 
> requires a higher priority processing.
>  # DeadNodeDetector will periodically detect the Node in 
> DeadNodeDetector#deadnode, If the access is successful, the Node will be 
> moved from DeadNodeDetector#deadnode. Continuous detection of the dead node 
> is necessary. The DataNode need rejoin the cluster due to a service 
> restart/machine repair. The DataNode may be permanently excluded if there is 
> no added probe mechanism.
>  # DeadNodeDetector#dfsInputStreamNodes Record the DFSInputstream using 
> DataNode. When the DFSInputstream is closed, it will be moved from 
> DeadNodeDetector#dfsInputStreamNodes.
>  # Every time get the global deanode, update the DeadNodeDetector#deadnode. 
> The new DeadNodeDetector#deadnode Equals to the intersection of the old 
> DeadNodeDetector#deadnode and the Datanodes are by 
> DeadNodeDetector#dfsInputStreamNodes.
>  # DeadNodeDetector has a switch that is turned off by default. When it is 
> closed, each DFSInputstream still uses its own local deadnode.
>  # This feature has been used in the XIAOMI production environment for a long 
> time. Reduced hbase read stuck, due to node hangs.
>  # Just open the DeadNodeDetector switch and you can use it directly. No 
> other restrictions. Don't want to use DeadNodeDetector, just close it.
> {code:java}
> if (sharedDeadNodesEnabled && deadNodeDetector == null) {
>   deadNodeDetector = new DeadNodeDetector(name);
>   deadNodeDetectorThr = new Daemon(deadNodeDetector);
>   deadNodeDetectorThr.start();
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model

2019-11-12 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16972980#comment-16972980
 ] 

Yiqun Lin commented on HDFS-14648:
--

The latest patch looks great, some more comments:

*ClientContext.java*
 We need a method to stop dead node detector thread and called this in 
DFSClient#close.
{code:java}
  /**
   * Close dead node detector thread.
   */
  public void stopDeadNodeDetectorThread() {
  if (deadNodeDetectorThr != null) {
  deadNodeDetectorThr.interrupt();
  try {
  deadNodeDetectorThr.join(3000);
  } catch (InterruptedException e) {
  LOG.warn("Encountered exception while waiting to join on dead 
node detector thread.", e);
  }
}
  }

.
  public synchronized void close() throws IOException {
if(clientRunning) {
  ...
  // close dead node detector thread
  clientContext.stopDeadNodeDetectorThread();
}
  }
{code}

 *DFSInputStream.java*
 I haven't seen the call {{dfsClient.addNodeToDeadNodeDetector}} added in 
method {{createBlockReader}} under this class.

 *DFSStripedInputStream.java*
 Can we remove dfsClient.addNodeToDeadNodeDetector in this class? It's not 
expected enable dead node detection in the EC mode.
{code:java}
   fetchBlockAt(block.getStartOffset());
-  addToDeadNodes(dnInfo.info);
+  addToLocalDeadNodes(dnInfo.info);
+  dfsClient.addNodeToDeadNodeDetector(this, dnInfo.info);   <=== be 
removed
 }
{code}

Can we also fix this whitespace warning?
{noformat}
./hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDeadNodeDetection.java:113:
  public void testDeadNodeDetectionInMultipleDFSInputStream() 
{noformat}
Others looks good to me now.
  

> DeadNodeDetector basic model
> 
>
> Key: HDFS-14648
> URL: https://issues.apache.org/jira/browse/HDFS-14648
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14648.001.patch, HDFS-14648.002.patch, 
> HDFS-14648.003.patch, HDFS-14648.004.patch, HDFS-14648.005.patch, 
> HDFS-14648.006.patch, HDFS-14648.007.patch, HDFS-14648.008.patch, 
> HDFS-14648.009.patch, HDFS-14648.010.patch
>
>
> This Jira constructs DeadNodeDetector state machine model. The function it 
> implements as follow:
>  # When a DFSInputstream is opened, a BlockReader is opened. If some DataNode 
> of the block is found to inaccessible, put the DataNode into 
> DeadNodeDetector#deadnode.(HDFS-14649) will optimize this part. Because when 
> DataNode is not accessible, it is likely that the replica has been removed 
> from the DataNode.Therefore, it needs to be confirmed by re-probing and 
> requires a higher priority processing.
>  # DeadNodeDetector will periodically detect the Node in 
> DeadNodeDetector#deadnode, If the access is successful, the Node will be 
> moved from DeadNodeDetector#deadnode. Continuous detection of the dead node 
> is necessary. The DataNode need rejoin the cluster due to a service 
> restart/machine repair. The DataNode may be permanently excluded if there is 
> no added probe mechanism.
>  # DeadNodeDetector#dfsInputStreamNodes Record the DFSInputstream using 
> DataNode. When the DFSInputstream is closed, it will be moved from 
> DeadNodeDetector#dfsInputStreamNodes.
>  # Every time get the global deanode, update the DeadNodeDetector#deadnode. 
> The new DeadNodeDetector#deadnode Equals to the intersection of the old 
> DeadNodeDetector#deadnode and the Datanodes are by 
> DeadNodeDetector#dfsInputStreamNodes.
>  # DeadNodeDetector has a switch that is turned off by default. When it is 
> closed, each DFSInputstream still uses its own local deadnode.
>  # This feature has been used in the XIAOMI production environment for a long 
> time. Reduced hbase read stuck, due to node hangs.
>  # Just open the DeadNodeDetector switch and you can use it directly. No 
> other restrictions. Don't want to use DeadNodeDetector, just close it.
> {code:java}
> if (sharedDeadNodesEnabled && deadNodeDetector == null) {
>   deadNodeDetector = new DeadNodeDetector(name);
>   deadNodeDetectorThr = new Daemon(deadNodeDetector);
>   deadNodeDetectorThr.start();
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model

2019-11-12 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16972735#comment-16972735
 ] 

Hadoop QA commented on HDFS-14648:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
47s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
11s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 
 8s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
14s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
55s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
49s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 35s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
18s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
40s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
11s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m  
8s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  3m  
8s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
51s{color} | {color:green} hadoop-hdfs-project: The patch generated 0 new + 112 
unchanged - 1 fixed = 112 total (was 113) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
40s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git 
apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply 
{color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
2s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 24s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
30s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
39s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  1m 
51s{color} | {color:green} hadoop-hdfs-client in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}106m 28s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
32s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}182m 49s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.TestReconstructStripedFileWithRandomECPolicy 
|
|   | hadoop.hdfs.tools.TestDFSZKFailoverController |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:104ccca9169 |
| JIRA Issue | HDFS-14648 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12985638/HDFS-14648.009.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  xml  |
| uname | Linux 40af05af70d8 4.15.0-66-generic 

[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model

2019-11-12 Thread Lisheng Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16972624#comment-16972624
 ] 

Lisheng Sun commented on HDFS-14648:


Thanks [~linyiqun] for good comments.
 i updated the patch as your comments and uploaded the v009 patch. Thank you.

> DeadNodeDetector basic model
> 
>
> Key: HDFS-14648
> URL: https://issues.apache.org/jira/browse/HDFS-14648
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14648.001.patch, HDFS-14648.002.patch, 
> HDFS-14648.003.patch, HDFS-14648.004.patch, HDFS-14648.005.patch, 
> HDFS-14648.006.patch, HDFS-14648.007.patch, HDFS-14648.008.patch, 
> HDFS-14648.009.patch
>
>
> This Jira constructs DeadNodeDetector state machine model. The function it 
> implements as follow:
>  # When a DFSInputstream is opened, a BlockReader is opened. If some DataNode 
> of the block is found to inaccessible, put the DataNode into 
> DeadNodeDetector#deadnode.(HDFS-14649) will optimize this part. Because when 
> DataNode is not accessible, it is likely that the replica has been removed 
> from the DataNode.Therefore, it needs to be confirmed by re-probing and 
> requires a higher priority processing.
>  # DeadNodeDetector will periodically detect the Node in 
> DeadNodeDetector#deadnode, If the access is successful, the Node will be 
> moved from DeadNodeDetector#deadnode. Continuous detection of the dead node 
> is necessary. The DataNode need rejoin the cluster due to a service 
> restart/machine repair. The DataNode may be permanently excluded if there is 
> no added probe mechanism.
>  # DeadNodeDetector#dfsInputStreamNodes Record the DFSInputstream using 
> DataNode. When the DFSInputstream is closed, it will be moved from 
> DeadNodeDetector#dfsInputStreamNodes.
>  # Every time get the global deanode, update the DeadNodeDetector#deadnode. 
> The new DeadNodeDetector#deadnode Equals to the intersection of the old 
> DeadNodeDetector#deadnode and the Datanodes are by 
> DeadNodeDetector#dfsInputStreamNodes.
>  # DeadNodeDetector has a switch that is turned off by default. When it is 
> closed, each DFSInputstream still uses its own local deadnode.
>  # This feature has been used in the XIAOMI production environment for a long 
> time. Reduced hbase read stuck, due to node hangs.
>  # Just open the DeadNodeDetector switch and you can use it directly. No 
> other restrictions. Don't want to use DeadNodeDetector, just close it.
> {code:java}
> if (sharedDeadNodesEnabled && deadNodeDetector == null) {
>   deadNodeDetector = new DeadNodeDetector(name);
>   deadNodeDetectorThr = new Daemon(deadNodeDetector);
>   deadNodeDetectorThr.start();
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model

2019-11-12 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16972571#comment-16972571
 ] 

Yiqun Lin commented on HDFS-14648:
--

Thanks [~leosun08] , the patch almost looks good now, only some minor comments:

*DFSInputStream.java*
 1. There are one additional places we can add the addNodeToDeadNodeDetector 
call in {{createBlockReader}}
{code:java}
  boolean createBlockReader(LocatedBlock block, long offsetInBlock,
  LocatedBlock[] targetBlocks, BlockReaderInfo[] readerInfos,
 ...
} else {
  //TODO: handles connection issues
  DFSClient.LOG.warn("Failed to connect to " + dnInfo.addr + " for " +
  "block" + block.getBlock(), e);
  // re-fetch the block in case the block has been moved
  fetchBlockAt(block.getStartOffset());
  addToLocalDeadNodes(dnInfo.info);
 //  <
}
  }
{code}
*DeadNodeDetector.java*
 1.Can you address this comment that missed?
{quote}1. Can we comment the name as Client context name

+ /**
 + * Client context name.
 + */
 + private String name;
{quote}
2. We can use the containsKey to check
{code:java}
  public boolean isDeadNode(DatanodeInfo datanodeInfo) {
return deadNodes.containsKey((datanodeInfo.getDatanodeUuid());
  }
{code}
Also we can use the key to remove in method clearAndGetDetectedDeadNodes
{code:java}
for (DatanodeInfo datanodeInfo : deadNodes.values()) {
  if (!newDeadNodes.contains(datanodeInfo)) {
deadNodes.remove(datanodeInfo.getDatanodeUuid());
  }
}
{code}
3. We can periodically call clearAndGetDetectedDeadNodes to make deadNodes list 
be refreshed. I think deadNodes list can be a little staled when the local dead 
node is cleared in dfs input stream.
{code:java}
  public void run() {
while (true) {
  clearAndGetDetectedDeadNodes();
  LOG.debug("Current detector state {}, the detected nodes: {}.", state,
  deadNodes.values());
  switch (state) {
{code}
4. Not fully get this. Why we still call this in the latest patch? Can you 
explain for this?
{noformat}
newDeadNodes.retainAll(deadNodes.values());
{noformat}

*TestDFSClientDetectDeadNodes.java*
 1. Can you rename the unit test name from {{TestDFSClientDetectDeadNodes}} to 
{{TestDeadNodeDetection}}? And simplified the comment to this:
{noformat}
+/**
+ * Tests for dead node detection in DFSClient.
+ */
+public class TestDeadNodeDetection {
{noformat}
Two other name updated:
 * testDetectDeadNodeInBackground --> testDeadNodeDetectionInBackground
 * testDeadNodeMultipleDFSInputStream --> 
testDeadNodeDetectionInMultipleDFSInputStream

2. No needed to call {{ThreadUtil.sleepAtLeastIgnoreInterrupts(10 * 1000L);}} I 
think.
 3. Can we extract the DFSClient here? I see we call many times getDFSClient().
{code:java}
assertEquals(1, din1.getDFSClient().getDeadNodes(din1).size());
assertEquals(1, din1.getDFSClient().getClientContext()
{code}
 

> DeadNodeDetector basic model
> 
>
> Key: HDFS-14648
> URL: https://issues.apache.org/jira/browse/HDFS-14648
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14648.001.patch, HDFS-14648.002.patch, 
> HDFS-14648.003.patch, HDFS-14648.004.patch, HDFS-14648.005.patch, 
> HDFS-14648.006.patch, HDFS-14648.007.patch, HDFS-14648.008.patch
>
>
> This Jira constructs DeadNodeDetector state machine model. The function it 
> implements as follow:
>  # When a DFSInputstream is opened, a BlockReader is opened. If some DataNode 
> of the block is found to inaccessible, put the DataNode into 
> DeadNodeDetector#deadnode.(HDFS-14649) will optimize this part. Because when 
> DataNode is not accessible, it is likely that the replica has been removed 
> from the DataNode.Therefore, it needs to be confirmed by re-probing and 
> requires a higher priority processing.
>  # DeadNodeDetector will periodically detect the Node in 
> DeadNodeDetector#deadnode, If the access is successful, the Node will be 
> moved from DeadNodeDetector#deadnode. Continuous detection of the dead node 
> is necessary. The DataNode need rejoin the cluster due to a service 
> restart/machine repair. The DataNode may be permanently excluded if there is 
> no added probe mechanism.
>  # DeadNodeDetector#dfsInputStreamNodes Record the DFSInputstream using 
> DataNode. When the DFSInputstream is closed, it will be moved from 
> DeadNodeDetector#dfsInputStreamNodes.
>  # Every time get the global deanode, update the DeadNodeDetector#deadnode. 
> The new DeadNodeDetector#deadnode Equals to the intersection of the old 
> DeadNodeDetector#deadnode and the Datanodes are by 
> DeadNodeDetector#dfsInputStreamNodes.
>  # DeadNodeDetector has a switch that is turned off by default. When it is 
> closed, each 

[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model

2019-11-12 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16972566#comment-16972566
 ] 

Hadoop QA commented on HDFS-14648:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
32s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
22s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 
 4s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
21s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
54s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
56s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 10s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
13s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
54s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
12s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  3m 
10s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 52s{color} | {color:orange} hadoop-hdfs-project: The patch generated 1 new + 
112 unchanged - 1 fixed = 113 total (was 113) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
45s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git 
apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply 
{color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
1s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 11s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  2m 
13s{color} | {color:red} hadoop-hdfs-project/hadoop-hdfs-client generated 1 new 
+ 0 unchanged - 0 fixed = 1 total (was 0) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
49s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  2m  
2s{color} | {color:green} hadoop-hdfs-client in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 96m 54s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
41s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}171m 42s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-hdfs-project/hadoop-hdfs-client |
|  |  org.apache.hadoop.hdfs.protocol.DatanodeInfo is incompatible with 
expected argument type String in 
org.apache.hadoop.hdfs.DeadNodeDetector.clearAndGetDetectedDeadNodes()  At 
DeadNodeDetector.java:argument type String in 
org.apache.hadoop.hdfs.DeadNodeDetector.clearAndGetDetectedDeadNodes()  At 
DeadNodeDetector.java:[line 165] |
| Failed junit tests | hadoop.hdfs.server.namenode.TestReencryption |
|   | 

[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model

2019-11-12 Thread Lisheng Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16972358#comment-16972358
 ] 

Lisheng Sun commented on HDFS-14648:


{quote}
2) The line newDeadNodes.retainAll(deadNodes.values()); should not be correct, 
it will let newDeadNodes be same with old deadnodes.
{code:java}
+  public synchronized Set getDeadNodesToDetect() {
+// remove the dead nodes who doesn't have any inputstream first
+Set newDeadNodes = new HashSet();
+for (HashSet datanodeInfos : dfsInputStreamNodes.values()) {
+  newDeadNodes.addAll(datanodeInfos);
+}
+
+newDeadNodes.retainAll(deadNodes.values());
+
+for (DatanodeInfo datanodeInfo : deadNodes.values()) {
+  if (!newDeadNodes.contains(datanodeInfo)) {
+deadNodes.remove(datanodeInfo);
+  }
+}
+return newDeadNodes;
+  }

{code}
{quote}
Thanks [~linyiqun] for deep review comments.
Finally, newDeadNodes should be same with old deadnodes in 
DeadNodeDetector#clearAndGetDetectedDeadNodes.
And updated the patch and uploaded the v008 patch. Thank you a lot. [~linyiqun]

> DeadNodeDetector basic model
> 
>
> Key: HDFS-14648
> URL: https://issues.apache.org/jira/browse/HDFS-14648
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14648.001.patch, HDFS-14648.002.patch, 
> HDFS-14648.003.patch, HDFS-14648.004.patch, HDFS-14648.005.patch, 
> HDFS-14648.006.patch, HDFS-14648.007.patch
>
>
> This Jira constructs DeadNodeDetector state machine model. The function it 
> implements as follow:
>  # When a DFSInputstream is opened, a BlockReader is opened. If some DataNode 
> of the block is found to inaccessible, put the DataNode into 
> DeadNodeDetector#deadnode.(HDFS-14649) will optimize this part. Because when 
> DataNode is not accessible, it is likely that the replica has been removed 
> from the DataNode.Therefore, it needs to be confirmed by re-probing and 
> requires a higher priority processing.
>  # DeadNodeDetector will periodically detect the Node in 
> DeadNodeDetector#deadnode, If the access is successful, the Node will be 
> moved from DeadNodeDetector#deadnode. Continuous detection of the dead node 
> is necessary. The DataNode need rejoin the cluster due to a service 
> restart/machine repair. The DataNode may be permanently excluded if there is 
> no added probe mechanism.
>  # DeadNodeDetector#dfsInputStreamNodes Record the DFSInputstream using 
> DataNode. When the DFSInputstream is closed, it will be moved from 
> DeadNodeDetector#dfsInputStreamNodes.
>  # Every time get the global deanode, update the DeadNodeDetector#deadnode. 
> The new DeadNodeDetector#deadnode Equals to the intersection of the old 
> DeadNodeDetector#deadnode and the Datanodes are by 
> DeadNodeDetector#dfsInputStreamNodes.
>  # DeadNodeDetector has a switch that is turned off by default. When it is 
> closed, each DFSInputstream still uses its own local deadnode.
>  # This feature has been used in the XIAOMI production environment for a long 
> time. Reduced hbase read stuck, due to node hangs.
>  # Just open the DeadNodeDetector switch and you can use it directly. No 
> other restrictions. Don't want to use DeadNodeDetector, just close it.
> {code:java}
> if (sharedDeadNodesEnabled && deadNodeDetector == null) {
>   deadNodeDetector = new DeadNodeDetector(name);
>   deadNodeDetectorThr = new Daemon(deadNodeDetector);
>   deadNodeDetectorThr.start();
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model

2019-11-11 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16972036#comment-16972036
 ] 

Yiqun Lin commented on HDFS-14648:
--

Thanks [~leosun08] for updating the patch! It looks more readable now. Some 
deep review comments from me:

*ClientContext.java*
1.Can we use {{deadNodeDetectionEnabled}} to replace 
{{sharedDeadNodesEnabled}}? This name will keep consistent with 
DeadNodeDetector. Can you replace this in the whole patch, including in some 
method comments?

*DFSClient.java*
1. Why we add the deadnode from dfsInputstream again? Just to get the latest 
dead node that hasn't  been detected? DeadNodeDetector should already add the  
dfsInputstream dead node when the deadnode was detected.
{code}
  public ConcurrentHashMap getDeadNodes(
+  DFSInputStream dfsInputStream) {
+if (clientContext.isSharedDeadNodesEnabled()) {
+  ConcurrentHashMap deadNodes =
+  new ConcurrentHashMap();
+  if (dfsInputStream != null) {
+deadNodes.putAll(dfsInputStream.getLocalDeadNodes());
+  }
+
+  Set detectDeadNodes =
+  clientContext.getDeadNodeDetector().getDeadNodesToDetect();
+  for (DatanodeInfo detectDeadNode : detectDeadNodes) {
+deadNodes.put(detectDeadNode, detectDeadNode);
+  }
...
{code}
2. Can we redundant '{}'?
{code}
LOG.debug("DeadNode detection is not enabled or given block {} is null, " +
+  "skip to remove node {}.", locatedBlocks);
{code} 
to
{code}
LOG.debug("DeadNode detection is not enabled or given block {} is null, " +
+  "skip to remove node.", locatedBlocks);
{code} 
 
*DeadNodeDetector.java*
1. Can we comment the name as Client context name
{code}
+  /**
+   * Client context name.
+   */
+  private String name;
{code}
2.Can we use datanodeuuid get from DatanodeInfo as the key? We use the same 
type object here and looks confused. datanodeuuid is the identity of one DN 
node.
{code}
private final ConcurrentHashMap deadNodes;
{code}
3. I think it will be better to print out the detected dead node info here.
{code}
LOG.debug("Current detector state {}, the detected nodes: {}.",);
{code}
4. Two comments for this:
1) Update the method name to clearAndGetDetectedDeadNodes
2) The line newDeadNodes.retainAll(deadNodes.values()); should not be correct, 
it will let newDeadNodes be same with old deadnodes.
{code}
+  public synchronized Set getDeadNodesToDetect() {
+// remove the dead nodes who doesn't have any inputstream first
+Set newDeadNodes = new HashSet();
+for (HashSet datanodeInfos : dfsInputStreamNodes.values()) {
+  newDeadNodes.addAll(datanodeInfos);
+}
+
+newDeadNodes.retainAll(deadNodes.values());
+
+for (DatanodeInfo datanodeInfo : deadNodes.values()) {
+  if (!newDeadNodes.contains(datanodeInfo)) {
+deadNodes.remove(datanodeInfo);
+  }
+}
+return newDeadNodes;
+  }
{code}

*TestDFSClientDetectDeadNodes.java*
Comments for testDetectDeadNodeInBackground  and 
testDeadNodeMultipleDFSInputStream: 
1). Can we check the deadnode size again after closing the dfs input stream?
2). Can we use DeadDetector#getDeadNodesToDetect to check dead node as well?
{code}
FSDataInputStream in = fs.open(filePath);
+try {
+  try {
+in.read();
+  } catch (BlockMissingException e) {
+  }
+
+  DFSInputStream din = (DFSInputStream) in.getWrappedStream();
+  assertEquals(3, din.getDFSClient().getDeadNodes(din).size());
   // use DeadDetector to get dead node as well
+} finally {
+  in.close();
+  fs.delete(new Path("/testDetectDeadNodeInBackground"),
+  true);
  // check the dead node again here, the dead node is expected be removed
+}
{code}
3. Can we check the dead node detail info as well, like dn uuid?
{code}
DFSInputStream din2 = (DFSInputStream) in1.getWrappedStream();
+  assertEquals(1, din1.getDFSClient().getDeadNodes(din1).size());
+  assertEquals(1, din2.getDFSClient().getDeadNodes(din2).size());
//  check the dn uuid of dead node to see if its expected dead node
{code}


> DeadNodeDetector basic model
> 
>
> Key: HDFS-14648
> URL: https://issues.apache.org/jira/browse/HDFS-14648
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14648.001.patch, HDFS-14648.002.patch, 
> HDFS-14648.003.patch, HDFS-14648.004.patch, HDFS-14648.005.patch, 
> HDFS-14648.006.patch, HDFS-14648.007.patch
>
>
> This Jira constructs DeadNodeDetector state machine model. The function it 
> implements as follow:
>  # When a DFSInputstream is opened, a BlockReader is opened. If some DataNode 
> of the block is found to inaccessible, put the DataNode into 
> DeadNodeDetector#deadnode.(HDFS-14649) will optimize this part. 

[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model

2019-11-10 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971324#comment-16971324
 ] 

Hadoop QA commented on HDFS-14648:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
47s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
19s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 
 6s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
14s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
51s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 30s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
15s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
41s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
11s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m  
9s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  3m  
9s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 50s{color} | {color:orange} hadoop-hdfs-project: The patch generated 2 new + 
112 unchanged - 1 fixed = 114 total (was 113) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
44s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
2s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 30s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
27s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
38s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  1m 
52s{color} | {color:green} hadoop-hdfs-client in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}108m 18s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
46s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}184m 58s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.hdfs.server.namenode.TestAddOverReplicatedStripedBlocks |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:104ccca9169 |
| JIRA Issue | HDFS-14648 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12985481/HDFS-14648.007.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  xml  |
| uname | Linux 338d8e78a090 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 
05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 

[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model

2019-11-10 Thread Lisheng Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971280#comment-16971280
 ] 

Lisheng Sun commented on HDFS-14648:


Thanks [~linyiqun] for deeply review and good commnets.

I updated the patch according to your comemnts and uploaed the v007 patch. 
Could you help take a review to it? Thank you a lot.

> DeadNodeDetector basic model
> 
>
> Key: HDFS-14648
> URL: https://issues.apache.org/jira/browse/HDFS-14648
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14648.001.patch, HDFS-14648.002.patch, 
> HDFS-14648.003.patch, HDFS-14648.004.patch, HDFS-14648.005.patch, 
> HDFS-14648.006.patch, HDFS-14648.007.patch
>
>
> This Jira constructs DeadNodeDetector state machine model. The function it 
> implements as follow:
>  # When a DFSInputstream is opened, a BlockReader is opened. If some DataNode 
> of the block is found to inaccessible, put the DataNode into 
> DeadNodeDetector#deadnode.(HDFS-14649) will optimize this part. Because when 
> DataNode is not accessible, it is likely that the replica has been removed 
> from the DataNode.Therefore, it needs to be confirmed by re-probing and 
> requires a higher priority processing.
>  # DeadNodeDetector will periodically detect the Node in 
> DeadNodeDetector#deadnode, If the access is successful, the Node will be 
> moved from DeadNodeDetector#deadnode. Continuous detection of the dead node 
> is necessary. The DataNode need rejoin the cluster due to a service 
> restart/machine repair. The DataNode may be permanently excluded if there is 
> no added probe mechanism.
>  # DeadNodeDetector#dfsInputStreamNodes Record the DFSInputstream using 
> DataNode. When the DFSInputstream is closed, it will be moved from 
> DeadNodeDetector#dfsInputStreamNodes.
>  # Every time get the global deanode, update the DeadNodeDetector#deadnode. 
> The new DeadNodeDetector#deadnode Equals to the intersection of the old 
> DeadNodeDetector#deadnode and the Datanodes are by 
> DeadNodeDetector#dfsInputStreamNodes.
>  # DeadNodeDetector has a switch that is turned off by default. When it is 
> closed, each DFSInputstream still uses its own local deadnode.
>  # This feature has been used in the XIAOMI production environment for a long 
> time. Reduced hbase read stuck, due to node hangs.
>  # Just open the DeadNodeDetector switch and you can use it directly. No 
> other restrictions. Don't want to use DeadNodeDetector, just close it.
> {code:java}
> if (sharedDeadNodesEnabled && deadNodeDetector == null) {
>   deadNodeDetector = new DeadNodeDetector(name);
>   deadNodeDetectorThr = new Daemon(deadNodeDetector);
>   deadNodeDetectorThr.start();
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model

2019-11-09 Thread Yiqun Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971007#comment-16971007
 ] 

Yiqun Lin commented on HDFS-14648:
--

The patch overall looks good, but some places seem not readable. I'd like to 
give some minor comments to improve this firstly. Will give further review 
comments since I am still in reviewing.

*ClientContext.java*
 1.'Detect the dead datanodes n advance', should be 'Detect the dead datanodes 
in advance,'
 2.Can we add the comment for these two methods since they frequently be called 
in other places?

*DFSClient.java*
 1.Simplified the comment for method ConcurrentHashMap getDeadNodes(DFSInputStream dfsInputStream)
{noformat}
+  /**
+   * If sharedDeadNodesEnabled is true, return the dead nodes are detected by
+   * all the DFSInputStreams in the same client. Otherwise return the dead 
nodes
+   * are detected by this DFSInputStream.
+   */
{noformat}
to
{noformat}
+  /**
+   * If sharedDeadNodesEnabled is true, return the dead nodes that detected by
+   * all the DFSInputStreams in the same client. Otherwise return the dead 
nodes
+   * that detected by given DFSInputStream.
+   */
{noformat}
2.Simplified the comment for method isDeadNode(DFSInputStream dfsInputStream, 
DatanodeInfo datanodeInfo)
{noformat}
+  /**
+   * If sharedDeadNodesEnabled is true, judgement based on whether this 
datanode
+   * is included or not in DeadNodeDetector#deadnodes. Otherwise judgment based
+   * on whether it is included or not in DFSInputStream#deadnodes.
+   */
{noformat}
to
{noformat}
+  /**
+   * If sharedDeadNodesEnabled is true, judgement based on whether this 
datanode
+   * is included or not in DeadNodeDetector. Otherwise judgment based given
+   * DFSInputStream.
+   */
{noformat}
3. It will be better to add one additional log here and update the method name.
{code:java}
 
+  /**
+   * Add given datanode in DeadNodeDetector.
+   */
public void addNodeToDeadNodeDetector(DFSInputStream dfsInputStream,
+  DatanodeInfo datanodeInfo) {
+if (!isSharedDeadNodesEnabled()) {
LOG.debug("DeadNode detection is not enabled, skip to add node {}.", 
datanodeInfo);
+  return;
+}
+clientContext.getDeadNodeDetector().addNodeToDetect(dfsInputStream,
+datanodeInfo);
+  }
{code}
4. The similar change for method removeNodeFromDetectByDFSInputStream
{code:java}
+  /**
+   * Remove given datanode from DeadNodeDetector.
+   */
+  public void removeNodeFromDeadNodeDetector(
+  DFSInputStream dfsInputStream, DatanodeInfo datanodeInfo) {
+if (!isSharedDeadNodesEnabled()) {
LOG.debug("DeadNode detection is not enabled, skip to remove node {}.", 
datanodeInfo);
+  return;
+}
+clientContext.getDeadNodeDetector()
+.removeNodeFromDetectByDFSInputStream(dfsInputStream, datanodeInfo);
+  }
{code}
5. Update for removeNodeFromDeadNodeDetector
{code:java}
   /**
+   * Remove datanodes that given block placed on from DeadNodeDetector.
+   */
+  public void removeNodeFromDetectByDFSInputStream(
+  DFSInputStream dfsInputStream, LocatedBlocks locatedBlocks) {
+if (!isSharedDeadNodesEnabled() || locatedBlocks == null) {
LOG.debug("DeadNode detection is not enabled or given block is null, 
skip to remove node.",);
+  return;
+}
+for (LocatedBlock locatedBlock : locatedBlocks.getLocatedBlocks()) {
+  for (DatanodeInfo datanodeInfo : locatedBlock.getLocations()) {
+removeNodeFromDetectByDFSInputStream(dfsInputStream, datanodeInfo);
+  }
+}
+  }
{code}
 * DFSInputStream.java*
 1. Update method name getDfsClient to getDFSClient
{code:java}
+  public DFSClient getDFSClient() {
+return dfsClient;
+  } 
{code}
2. Can we reduce the visibility of these methods? I don't think all of them 
need to be public, protected or private should be enough.
{code:java}
+  public void removeFromLocalDeadNodes(DatanodeInfo dnInfo) {
+deadNodes.remove(dnInfo);
+  }
+
+  public ConcurrentHashMap getLocalDeadNodes() {
+return deadNodes;
+  }
+
+  public void clearLocalDeadNodes() {
+deadNodes.clear();
+  }
+
+  public DFSClient getDfsClient() {
+return dfsClient;
+  }
{code}

 * DeadNodeDetector.java*
 1. Can we add the comment for the sleep time and name?
{code:java}
+  private static final long ERROR_SLEEP_MS = 5000;
+  private static final long IDLE_SLEEP_MS = 1;
+
+  private String name;
{code}
2. Update the log to use the parameter way and 'start' should be 'Start'.
{code:java}
LOG.info("start dead node detector for DFSClient " + this.name);
{code}
to
{code:java}
LOG.info("Start dead node detector for DFSClient {}.", name);
{code}
3.Update the state log
{code:java}
LOG.debug("state " + state);
{code}
to
{code:java}
LOG.debug("Current detector state {}.", state);
{code}

 * StripeReader.java*
 The class doesn't need to do any change since we don't make name change for 
method getDFSClient.

 * 

[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model

2019-09-04 Thread Lisheng Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922317#comment-16922317
 ] 

Lisheng Sun commented on HDFS-14648:


Detailed DeadNodeDetector desgin documents attached HDFS-13571.

> DeadNodeDetector basic model
> 
>
> Key: HDFS-14648
> URL: https://issues.apache.org/jira/browse/HDFS-14648
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14648.001.patch, HDFS-14648.002.patch, 
> HDFS-14648.003.patch, HDFS-14648.004.patch, HDFS-14648.005.patch, 
> HDFS-14648.006.patch
>
>
> This Jira constructs DeadNodeDetector state machine model. The function it 
> implements as follow:
>  # When a DFSInputstream is opened, a BlockReader is opened. If some DataNode 
> of the block is found to inaccessible, put the DataNode into 
> DeadNodeDetector#deadnode.(HDFS-14649) will optimize this part. Because when 
> DataNode is not accessible, it is likely that the replica has been removed 
> from the DataNode.Therefore, it needs to be confirmed by re-probing and 
> requires a higher priority processing.
>  # DeadNodeDetector will periodically detect the Node in 
> DeadNodeDetector#deadnode, If the access is successful, the Node will be 
> moved from DeadNodeDetector#deadnode. Continuous detection of the dead node 
> is necessary. The DataNode need rejoin the cluster due to a service 
> restart/machine repair. The DataNode may be permanently excluded if there is 
> no added probe mechanism.
>  # DeadNodeDetector#dfsInputStreamNodes Record the DFSInputstream using 
> DataNode. When the DFSInputstream is closed, it will be moved from 
> DeadNodeDetector#dfsInputStreamNodes.
>  # Every time get the global deanode, update the DeadNodeDetector#deadnode. 
> The new DeadNodeDetector#deadnode Equals to the intersection of the old 
> DeadNodeDetector#deadnode and the Datanodes are by 
> DeadNodeDetector#dfsInputStreamNodes.
>  # DeadNodeDetector has a switch that is turned off by default. When it is 
> closed, each DFSInputstream still uses its own local deadnode.
>  # This feature has been used in the XIAOMI production environment for a long 
> time. Reduced hbase read stuck, due to node hangs.
>  # Just open the DeadNodeDetector switch and you can use it directly. No 
> other restrictions. Don't want to use DeadNodeDetector, just close it.
> {code:java}
> if (sharedDeadNodesEnabled && deadNodeDetector == null) {
>   deadNodeDetector = new DeadNodeDetector(name);
>   deadNodeDetectorThr = new Daemon(deadNodeDetector);
>   deadNodeDetectorThr.start();
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model

2019-08-31 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920062#comment-16920062
 ] 

Hadoop QA commented on HDFS-14648:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
34s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m 
51s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 
40s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
49s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
50s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 46s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
49s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
29s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
12s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  2m 
41s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 44s{color} | {color:orange} hadoop-hdfs-project: The patch generated 1 new + 
127 unchanged - 1 fixed = 128 total (was 128) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
38s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
1s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 40s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
31s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
28s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  1m 
58s{color} | {color:green} hadoop-hdfs-client in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 96m 24s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
36s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}173m 33s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.TestDFSStorageStateRecovery |
|   | hadoop.hdfs.server.balancer.TestBalancerRPCDelay |
|   | hadoop.hdfs.TestErasureCodingPolicies |
|   | hadoop.hdfs.TestDecommissionWithStriped |
|   | hadoop.hdfs.TestReconstructStripedFileWithRandomECPolicy |
|   | hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFSStriped |
|   | hadoop.hdfs.qjournal.client.TestQJMWithFaults |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.1 Server=19.03.1 Image:yetus/hadoop:bdbca0e53b4 |
| JIRA Issue | HDFS-14648 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12979034/HDFS-14648.006.patch 

[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model

2019-08-29 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918855#comment-16918855
 ] 

Hadoop QA commented on HDFS-14648:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
46s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
11s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 
 8s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
55s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
50s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
44s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 54s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
38s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
27s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
13s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  3m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
53s{color} | {color:green} hadoop-hdfs-project: The patch generated 0 new + 127 
unchanged - 1 fixed = 127 total (was 128) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
56s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch has 7 line(s) that end in whitespace. Use git 
apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply 
{color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
2s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 31s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
17s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red}  0m 
26s{color} | {color:red} hadoop-hdfs-project_hadoop-hdfs-client generated 1 new 
+ 0 unchanged - 0 fixed = 1 total (was 0) {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  1m 
47s{color} | {color:green} hadoop-hdfs-client in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}112m 41s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
42s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}184m 42s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.TestPread |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=18.09.7 Server=18.09.7 Image:yetus/hadoop:bdbca0e53b4 |
| JIRA Issue | HDFS-14648 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12978902/HDFS-14648.005.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  xml  |
| uname | Linux 867e469805f8 4.15.0-52-generic #56-Ubuntu SMP Tue Jun 

[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model

2019-08-29 Thread Lisheng Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918707#comment-16918707
 ] 

Lisheng Sun commented on HDFS-14648:


Thank [~hexiaoqiao] for deep reivew and good suggestion.
{quote}
4. about DeadNodeDetector, I do not get why set different STATEs for 
`DeadNodeDetector`, will be used by the following implements?
{quote}
 That set different STATEs for `DeadNodeDetector` is when DeadNodeDetector is 
in different state, it does different  in future implementations. E.g when 
DeadNodeDetector is CHECK_DEAD state, it can detect dead node in 
DeadNodeDetector#deadnode.
It is also eaiser for future expansion.
{quote}
7. IIUC, some nodes are detected as dead, it should not be in next pipeline, 
right? but I do not see anywhere to add this deadNodes to `excludeNodes`
{quote}
If datanode is consider as deadnode, it is placed in DFSInputsteam#deadnode and 
DeadNodeDetector#deadnode.  Other dfsintputSream should no longer access this 
datanode.

Updatd this patch and uploaded the v005 patch. Could you mind continue to take 
a review for this patch? Thank [~hexiaoqiao] again.

> DeadNodeDetector basic model
> 
>
> Key: HDFS-14648
> URL: https://issues.apache.org/jira/browse/HDFS-14648
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14648.001.patch, HDFS-14648.002.patch, 
> HDFS-14648.003.patch, HDFS-14648.004.patch, HDFS-14648.005.patch
>
>
> This Jira constructs DeadNodeDetector state machine model. The function it 
> implements as follow:
>  # When a DFSInputstream is opened, a BlockReader is opened. If some DataNode 
> of the block is found to inaccessible, put the DataNode into 
> DeadNodeDetector#deadnode.(HDFS-14649) will optimize this part. Because when 
> DataNode is not accessible, it is likely that the replica has been removed 
> from the DataNode.Therefore, it needs to be confirmed by re-probing and 
> requires a higher priority processing.
>  # DeadNodeDetector will periodically detect the Node in 
> DeadNodeDetector#deadnode, If the access is successful, the Node will be 
> moved from DeadNodeDetector#deadnode. Continuous detection of the dead node 
> is necessary. The DataNode need rejoin the cluster due to a service 
> restart/machine repair. The DataNode may be permanently excluded if there is 
> no added probe mechanism.
>  # DeadNodeDetector#dfsInputStreamNodes Record the DFSInputstream using 
> DataNode. When the DFSInputstream is closed, it will be moved from 
> DeadNodeDetector#dfsInputStreamNodes.
>  # Every time get the global deanode, update the DeadNodeDetector#deadnode. 
> The new DeadNodeDetector#deadnode Equals to the intersection of the old 
> DeadNodeDetector#deadnode and the Datanodes are by 
> DeadNodeDetector#dfsInputStreamNodes.
>  # DeadNodeDetector has a switch that is turned off by default. When it is 
> closed, each DFSInputstream still uses its own local deadnode.
>  # This feature has been used in the XIAOMI production environment for a long 
> time. Reduced hbase read stuck, due to node hangs.
>  # Just open the DeadNodeDetector switch and you can use it directly. No 
> other restrictions. Don't want to use DeadNodeDetector, just close it.
> {code:java}
> if (sharedDeadNodesEnabled && deadNodeDetector == null) {
>   deadNodeDetector = new DeadNodeDetector(name);
>   deadNodeDetectorThr = new Daemon(deadNodeDetector);
>   deadNodeDetectorThr.start();
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model

2019-08-25 Thread Lisheng Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915281#comment-16915281
 ] 

Lisheng Sun commented on HDFS-14648:


Thank [~zhangchen] for good suggestion. I will upload the a design doc which 
describes this patch in detail later.

> DeadNodeDetector basic model
> 
>
> Key: HDFS-14648
> URL: https://issues.apache.org/jira/browse/HDFS-14648
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14648.001.patch, HDFS-14648.002.patch, 
> HDFS-14648.003.patch, HDFS-14648.004.patch
>
>
> This Jira constructs DeadNodeDetector state machine model. The function it 
> implements as follow:
>  # When a DFSInputstream is opened, a BlockReader is opened. If some DataNode 
> of the block is found to inaccessible, put the DataNode into 
> DeadNodeDetector#deadnode.(HDFS-14649) will optimize this part. Because when 
> DataNode is not accessible, it is likely that the replica has been removed 
> from the DataNode.Therefore, it needs to be confirmed by re-probing and 
> requires a higher priority processing.
>  # DeadNodeDetector will periodically detect the Node in 
> DeadNodeDetector#deadnode, If the access is successful, the Node will be 
> moved from DeadNodeDetector#deadnode. Continuous detection of the dead node 
> is necessary. The DataNode need rejoin the cluster due to a service 
> restart/machine repair. The DataNode may be permanently excluded if there is 
> no added probe mechanism.
>  # DeadNodeDetector#dfsInputStreamNodes Record the DFSInputstream using 
> DataNode. When the DFSInputstream is closed, it will be moved from 
> DeadNodeDetector#dfsInputStreamNodes.
>  # Every time get the global deanode, update the DeadNodeDetector#deadnode. 
> The new DeadNodeDetector#deadnode Equals to the intersection of the old 
> DeadNodeDetector#deadnode and the Datanodes are by 
> DeadNodeDetector#dfsInputStreamNodes.
>  # DeadNodeDetector has a switch that is turned off by default. When it is 
> closed, each DFSInputstream still uses its own local deadnode.
>  # This feature has been used in the XIAOMI production environment for a long 
> time. Reduced hbase read stuck, due to node hangs.
>  # Just open the DeadNodeDetector switch and you can use it directly. No 
> other restrictions. Don't want to use DeadNodeDetector, just close it.
> {code:java}
> if (sharedDeadNodesEnabled && deadNodeDetector == null) {
>   deadNodeDetector = new DeadNodeDetector(name);
>   deadNodeDetectorThr = new Daemon(deadNodeDetector);
>   deadNodeDetectorThr.start();
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model

2019-08-25 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915262#comment-16915262
 ] 

He Xiaoqiao commented on HDFS-14648:


Thanks [~leosun08] for your contributions. I try to learn 
[^HDFS-14648.004.patch] with some minor comments,
1. Some codestyles reported by Jenkins above, Please take a look.
2. It is better to locate same module configuration together in 
hdfs-default.xml to find conveniently relevant items,  just suggest to define 
`dfs.client.deadnode.detect.enabled` together with other `dfs.client.*` items.
3. Some method name is open to different interpretations. for instance, 
`DeadNodeDetector#removeNodeFromDetect`, my first impression is that remove 
node from `deadNodes` shared by all DFSInputStreams in the same DFSClient, 
however it just mean remove this node from `localNodes` which just visible to 
some one DFSInputStream actually. right?
4. about DeadNodeDetector, I do not get why set different STATEs for 
`DeadNodeDetector`, will be used by the following implements?
5. about DeadNodeDetector#run, does it need to catch InterruptedException out 
of while loop and return?
6. Some constant such as '5000'/'1' but not explain why, I think we should 
define this constant at the begin of Class and add some annotation.
7. IIUC, some nodes are detected as dead, it should not be in next pipeline, 
right? but I do not see anywhere to add this deadNodes to `excludeNodes`.
Thanks [~leosun08] again.

> DeadNodeDetector basic model
> 
>
> Key: HDFS-14648
> URL: https://issues.apache.org/jira/browse/HDFS-14648
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14648.001.patch, HDFS-14648.002.patch, 
> HDFS-14648.003.patch, HDFS-14648.004.patch
>
>
> This Jira constructs DeadNodeDetector state machine model. The function it 
> implements as follow:
>  # When a DFSInputstream is opened, a BlockReader is opened. If some DataNode 
> of the block is found to inaccessible, put the DataNode into 
> DeadNodeDetector#deadnode.(HDFS-14649) will optimize this part. Because when 
> DataNode is not accessible, it is likely that the replica has been removed 
> from the DataNode.Therefore, it needs to be confirmed by re-probing and 
> requires a higher priority processing.
>  # DeadNodeDetector will periodically detect the Node in 
> DeadNodeDetector#deadnode, If the access is successful, the Node will be 
> moved from DeadNodeDetector#deadnode. Continuous detection of the dead node 
> is necessary. The DataNode need rejoin the cluster due to a service 
> restart/machine repair. The DataNode may be permanently excluded if there is 
> no added probe mechanism.
>  # DeadNodeDetector#dfsInputStreamNodes Record the DFSInputstream using 
> DataNode. When the DFSInputstream is closed, it will be moved from 
> DeadNodeDetector#dfsInputStreamNodes.
>  # Every time get the global deanode, update the DeadNodeDetector#deadnode. 
> The new DeadNodeDetector#deadnode Equals to the intersection of the old 
> DeadNodeDetector#deadnode and the Datanodes are by 
> DeadNodeDetector#dfsInputStreamNodes.
>  # DeadNodeDetector has a switch that is turned off by default. When it is 
> closed, each DFSInputstream still uses its own local deadnode.
>  # This feature has been used in the XIAOMI production environment for a long 
> time. Reduced hbase read stuck, due to node hangs.
>  # Just open the DeadNodeDetector switch and you can use it directly. No 
> other restrictions. Don't want to use DeadNodeDetector, just close it.
> {code:java}
> if (sharedDeadNodesEnabled && deadNodeDetector == null) {
>   deadNodeDetector = new DeadNodeDetector(name);
>   deadNodeDetectorThr = new Daemon(deadNodeDetector);
>   deadNodeDetectorThr.start();
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model

2019-08-24 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915165#comment-16915165
 ] 

Chen Zhang commented on HDFS-14648:
---

[~leosun08], since this is a big patch, I think it would be better to upload a 
design doc with detailed description and flowchart or graphs for other 
reviewers to understand the patch easily, which can accelerate the review 
progress.

> DeadNodeDetector basic model
> 
>
> Key: HDFS-14648
> URL: https://issues.apache.org/jira/browse/HDFS-14648
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14648.001.patch, HDFS-14648.002.patch, 
> HDFS-14648.003.patch, HDFS-14648.004.patch
>
>
> This Jira constructs DeadNodeDetector state machine model. The function it 
> implements as follow:
>  # When a DFSInputstream is opened, a BlockReader is opened. If some DataNode 
> of the block is found to inaccessible, put the DataNode into 
> DeadNodeDetector#deadnode.(HDFS-14649) will optimize this part. Because when 
> DataNode is not accessible, it is likely that the replica has been removed 
> from the DataNode.Therefore, it needs to be confirmed by re-probing and 
> requires a higher priority processing.
>  # DeadNodeDetector will periodically detect the Node in 
> DeadNodeDetector#deadnode, If the access is successful, the Node will be 
> moved from DeadNodeDetector#deadnode. Continuous detection of the dead node 
> is necessary. The DataNode need rejoin the cluster due to a service 
> restart/machine repair. The DataNode may be permanently excluded if there is 
> no added probe mechanism.
>  # DeadNodeDetector#dfsInputStreamNodes Record the DFSInputstream using 
> DataNode. When the DFSInputstream is closed, it will be moved from 
> DeadNodeDetector#dfsInputStreamNodes.
>  # Every time get the global deanode, update the DeadNodeDetector#deadnode. 
> The new DeadNodeDetector#deadnode Equals to the intersection of the old 
> DeadNodeDetector#deadnode and the Datanodes are by 
> DeadNodeDetector#dfsInputStreamNodes.
>  # DeadNodeDetector has a switch that is turned off by default. When it is 
> closed, each DFSInputstream still uses its own local deadnode.
>  # This feature has been used in the XIAOMI production environment for a long 
> time. Reduced hbase read stuck, due to node hangs.
>  # Just open the DeadNodeDetector switch and you can use it directly. No 
> other restrictions. Don't want to use DeadNodeDetector, just close it.
> {code:java}
> if (sharedDeadNodesEnabled && deadNodeDetector == null) {
>   deadNodeDetector = new DeadNodeDetector(name);
>   deadNodeDetectorThr = new Daemon(deadNodeDetector);
>   deadNodeDetectorThr.start();
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model

2019-08-24 Thread Lisheng Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915143#comment-16915143
 ] 

Lisheng Sun commented on HDFS-14648:


Thank [~jojochuang] for your good suggetions. I updated the descriptions and 
introducd the function and its implementation in more detail.
I have shared this feature with the MeiTuan, JingDong HDFS Team. They are 
looking forward to this feature and hope to use it sooner. Thank [~jojochuang] 
again.

> DeadNodeDetector basic model
> 
>
> Key: HDFS-14648
> URL: https://issues.apache.org/jira/browse/HDFS-14648
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14648.001.patch, HDFS-14648.002.patch, 
> HDFS-14648.003.patch, HDFS-14648.004.patch
>
>
> This Jira constructs DeadNodeDetector state machine model. The function it 
> implements as follow:
>  # When a DFSInputstream is opened, a BlockReader is opened. If some DataNode 
> of the block is found to inaccessible, put the DataNode into 
> DeadNodeDetector#deadnode.(HDFS-14649) will optimize this part. Because when 
> DataNode is not accessible, it is likely that the replica has been removed 
> from the DataNode.Therefore, it needs to be confirmed by re-probing and 
> requires a higher priority processing.
>  # DeadNodeDetector will periodically detect the Node in 
> DeadNodeDetector#deadnode, If the access is successful, the Node will be 
> moved from DeadNodeDetector#deadnode. Continuous detection of the dead node 
> is necessary. The DataNode need rejoin the cluster due to a service 
> restart/machine repair. The DataNode may be permanently excluded if there is 
> no added probe mechanism.
>  # DeadNodeDetector#dfsInputStreamNodes Record the DFSInputstream using 
> DataNode. When the DFSInputstream is closed, it will be moved from 
> DeadNodeDetector#dfsInputStreamNodes.
>  # Every time get the global deanode, update the DeadNodeDetector#deadnode. 
> The new DeadNodeDetector#deadnode Equals to the intersection of the old 
> DeadNodeDetector#deadnode and the Datanodes are by 
> DeadNodeDetector#dfsInputStreamNodes.
>  # DeadNodeDetector has a switch that is turned off by default. When it is 
> closed, each DFSInputstream still uses its own local deadnode.
>  # This feature has been used in the XIAOMI production environment for a long 
> time. Reduced hbase read stuck, due to node hangs.
>  # Just open the DeadNodeDetector switch and you can use it directly. No 
> other restrictions. Don't want to use DeadNodeDetector, just close it.
> {code:java}
> if (sharedDeadNodesEnabled && deadNodeDetector == null) {
>   deadNodeDetector = new DeadNodeDetector(name);
>   deadNodeDetectorThr = new Daemon(deadNodeDetector);
>   deadNodeDetectorThr.start();
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model

2019-08-23 Thread Duo Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914760#comment-16914760
 ] 

Duo Zhang commented on HDFS-14648:
--

We have been using this in our production for a long time. It does solve a big 
problem for us in HBase.

You know, for performance, HBase will open a dfs input stream and never close 
it unless the file has been compacted away. So when a DN is broken, every dfs 
input stream needs to find out the broken DN by its own, since every dfs input 
stream manages its own live/dead nodes.

If it is just a process crash, HBase will be fine, since when we touch the dead 
DN, we will receive a connection refused immediately and then go to other DNs. 
But if the machine is completely down, we will hang there for a long time and 
finally receive a connection timeout. Usually the connection timeout will be a 
value which is a bit large(15 seconds in our deploy), as there is no way to set 
the value per request so we have to find a value which is greater than most of 
the timeout values from HBase requests. 

This is really a big problem for us. For a 300+ nodes cluster, a machine 
failure will make the availability down for more than 2 hours!

So I think this is really a useful feature for HBase.

Thanks.

> DeadNodeDetector basic model
> 
>
> Key: HDFS-14648
> URL: https://issues.apache.org/jira/browse/HDFS-14648
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14648.001.patch, HDFS-14648.002.patch, 
> HDFS-14648.003.patch, HDFS-14648.004.patch
>
>
> This Jira constructs DeadNodeDetector state machine model. The function it 
> implements as follow:
>  # After DFSInputstream detects some DataNode die, it put in DeadNodeDetector 
> and share this information to others in the same DFSClient. The ohter 
> DFSInputstreams will not read this DataNode.
>  # DeadNodeDetector also have DFSInputstream reference relationships to each 
> DataNode. When DFSInputstream close, DeadNodeDetector also remove this 
> reference. If some DeadNode of DeadNodeDetector is not read by 
> DFSInputstream, it also is removed from DeadNodeDetector.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model

2019-08-23 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914479#comment-16914479
 ] 

Wei-Chiu Chuang commented on HDFS-14648:


[~leosun08] thanks for filing the jira. This appears like a pretty big patch 
and touches a lot of internals in DFS client.
Please expect longer turn-around time.

Additionally, because the patch itself is non-trivial, let's find a better way 
to drive this forward.

Instead of writing a few lines of high level descriptions, and pinging people 
on the jira,
(1) would you please update the description to fill in the implementation 
details?
(2) any thing worth noting? any new configurations? any thing that is not 
supported/considered?
(3) has this been tested in production? can this be turned off?

Would some one like to sign up as the reviewer for this patch?

Historically, HDFS developers don't invest much HDFS client implementation. It 
feels to me HBase community would be more interested. [~busbey] [~Apache9] do 
you know who might be able to provide valuable feedback for this patch? Thanks.

> DeadNodeDetector basic model
> 
>
> Key: HDFS-14648
> URL: https://issues.apache.org/jira/browse/HDFS-14648
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14648.001.patch, HDFS-14648.002.patch, 
> HDFS-14648.003.patch, HDFS-14648.004.patch
>
>
> This Jira constructs DeadNodeDetector state machine model. The function it 
> implements as follow:
>  # After DFSInputstream detects some DataNode die, it put in DeadNodeDetector 
> and share this information to others in the same DFSClient. The ohter 
> DFSInputstreams will not read this DataNode.
>  # DeadNodeDetector also have DFSInputstream reference relationships to each 
> DataNode. When DFSInputstream close, DeadNodeDetector also remove this 
> reference. If some DeadNode of DeadNodeDetector is not read by 
> DFSInputstream, it also is removed from DeadNodeDetector.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org