[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2022-05-20 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17540321#comment-17540321
 ] 

Michael Stack commented on HDFS-14997:
--

Linking an issue where the change in how we do command processing uncovers an 
old, existing problem. Perhaps it is of interest.

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0, 3.2.3, 3.2.4
>
> Attachments: HDFS-14997-branch-3.2.001.patch, HDFS-14997.001.patch, 
> HDFS-14997.002.patch, HDFS-14997.003.patch, HDFS-14997.004.patch, 
> HDFS-14997.005.patch, HDFS-14997.addendum.patch, 
> image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2021-09-22 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17418516#comment-17418516
 ] 

Xiaoqiao He commented on HDFS-14997:


cherry-pick to branch-3.2 and branch-3.2.3.

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0, 3.2.3, 3.2.4
>
> Attachments: HDFS-14997-branch-3.2.001.patch, HDFS-14997.001.patch, 
> HDFS-14997.002.patch, HDFS-14997.003.patch, HDFS-14997.004.patch, 
> HDFS-14997.005.patch, HDFS-14997.addendum.patch, 
> image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2021-09-17 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416722#comment-17416722
 ] 

Xiaoqiao He commented on HDFS-14997:


[~brahmareddy] This patch is ready now, less conflict and verify at local, I 
just check in for branch-3.2. If no other objection, I will check in this 
related patch to branch-3.2.3 also. What do you think? Thanks.

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0, 3.2.4
>
> Attachments: HDFS-14997-branch-3.2.001.patch, HDFS-14997.001.patch, 
> HDFS-14997.002.patch, HDFS-14997.003.patch, HDFS-14997.004.patch, 
> HDFS-14997.005.patch, HDFS-14997.addendum.patch, 
> image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2021-09-17 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416630#comment-17416630
 ] 

Xiaoqiao He commented on HDFS-14997:


Failed unit tests seems not related to this changes. Will commit to branch-3.2 
for a short while.

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14997-branch-3.2.001.patch, HDFS-14997.001.patch, 
> HDFS-14997.002.patch, HDFS-14997.003.patch, HDFS-14997.004.patch, 
> HDFS-14997.005.patch, HDFS-14997.addendum.patch, 
> image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2021-09-15 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415390#comment-17415390
 ] 

Hadoop QA commented on HDFS-14997:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 10m 
54s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} {color} | {color:green}  0m  0s{color} 
| {color:green}test4tests{color} | {color:green} The patch appears to include 1 
new or modified test files. {color} |
|| || || || {color:brown} branch-3.2 Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 26m 
40s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
1s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
46s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
10s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 48s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
4s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 19m  
2s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are 
enabled, using SpotBugs. {color} |
| {color:green}+1{color} | {color:green} spotbugs {color} | {color:green}  3m  
8s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
14s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
4s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
4s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
42s{color} | {color:green}{color} | {color:green} 
hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 114 unchanged - 4 
fixed = 114 total (was 118) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
8s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace 
issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m  0s{color} | {color:green}{color} | {color:green} patch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
55s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} spotbugs {color} | {color:green}  3m 
10s{color} | {color:green}{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} || ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}201m 22s{color} 
| 
{color:red}https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/715/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt{color}
 | {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
55s{color} | {color:green}{color} | {color:green} The patch does not generate 
ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}285m 25s{color} | 
{color:black}{color} | {color:black}{color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.TestFileChecksum |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailureWithRandomECPolicy |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure |
|   | hadoop.hdfs.TestReconstructStripedFile |
|   | hadoop.hdfs.server.balancer.TestBalancerWithHANameNodes |
|   | hadoop.hdfs.server.namenode.ha.TestRetryCacheWith

[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2021-09-15 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415380#comment-17415380
 ] 

Hadoop QA commented on HDFS-14997:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  8m 
50s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} {color} | {color:green}  0m  0s{color} 
| {color:green}test4tests{color} | {color:green} The patch appears to include 1 
new or modified test files. {color} |
|| || || || {color:brown} branch-3.2 Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 26m 
11s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
0s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
47s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
11s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 58s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
4s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 17m 
46s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are 
enabled, using SpotBugs. {color} |
| {color:green}+1{color} | {color:green} spotbugs {color} | {color:green}  2m 
44s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 1s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
52s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
52s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
38s{color} | {color:green}{color} | {color:green} 
hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 114 unchanged - 4 
fixed = 114 total (was 118) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
2s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace 
issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m  2s{color} | {color:green}{color} | {color:green} patch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
54s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} spotbugs {color} | {color:green}  2m 
48s{color} | {color:green}{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} || ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}193m  0s{color} 
| 
{color:red}https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/716/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt{color}
 | {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
42s{color} | {color:green}{color} | {color:green} The patch does not generate 
ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}270m  3s{color} | 
{color:black}{color} | {color:black}{color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots |
|   | hadoop.hdfs.TestFileChecksum |
|   | hadoop.hdfs.TestReconstructStripedFileWithValidator |
|   | hadoop.hdfs.TestReconstructStripedFile |
|   | hadoop.hdfs.server.blockmanagement.TestUnderReplicatedBlocks |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailur

[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2020-09-27 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17203015#comment-17203015
 ] 

Xiaoqiao He commented on HDFS-14997:


[~Captainhzy][~sodonnell] Thanks for your comments. 
BPOfferService#mReadWriteLock is indeed another competition point. In my 
practice, processes commands asynchronously could mitigate most of datanode 
issues, but not complete. If any datanode traces share will help to dig this 
case further more.
{quote}I have an idea. It can put the `updateActorStatesFromHeartbeat` function 
to `CommandProcessingThread`. In this case, it will not block heartbeat due to 
`writeLock`.{quote}
Would you like to file another JIRA and try to submit patch? Thanks.

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, 
> HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch, 
> HDFS-14997.addendum.patch, image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2020-09-10 Thread Stephen O'Donnell (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17193490#comment-17193490
 ] 

Stephen O'Donnell commented on HDFS-14997:
--

[~Captainhzy] Do you have this patch included on your cluster and you see the 
heartbeat thread blocked for a long time?

What type of command is holding the lock? Do you have any jstack traces showing 
the problem?

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, 
> HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch, 
> HDFS-14997.addendum.patch, image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2020-09-09 Thread zy.jordan (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17193381#comment-17193381
 ] 

zy.jordan commented on HDFS-14997:
--

Hi, [~hexiaoqiao]. I have encounter this problem in a very heavy cmd.

I have an idea. It can put the `updateActorStatesFromHeartbeat` function to 
`CommandProcessingThread`. In this case, it will not block heartbeat due to 
`writeLock`.

Do you think it will do effort?

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, 
> HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch, 
> HDFS-14997.addendum.patch, image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2020-08-25 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17184920#comment-17184920
 ] 

Xiaoqiao He commented on HDFS-14997:


Thanks [~Captainhzy], In my experience, process cmds async can resolve most 
cases about DataNode lost because no heartbeat for long times from NameNode 
views.
In general, we have split cmds to process one by one asyc rather than process 
all of them in main flow. So I believe it improved significantly.
About the lock contention, I agree that it could be blocked especially process 
one very heavy cmd(maybe some corner case). Any ideas to improve it? welcome to 
more discuss. Thanks again.

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, 
> HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch, 
> HDFS-14997.addendum.patch, image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2020-08-25 Thread zy.jordan (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17184873#comment-17184873
 ] 

zy.jordan commented on HDFS-14997:
--

Thanks for [~hexiaoqiao]'s answer. 

Due to `updateActorStatesFromHeartbeat` function in the 
`BPServiceActor.offerService` need to hold write lock. And 
`processCommandFromActive` function in the `CommandProcessingThread` also need 
to hold the same write lock. So when 'CommandProcessingThread` work, it also 
block the `BPServiceActor`, and block the heartbeat.

This is my view, it might not be right. If so, please correct it, and thanks a 
lot. 

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, 
> HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch, 
> HDFS-14997.addendum.patch, image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2020-08-20 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181643#comment-17181643
 ] 

Xiaoqiao He commented on HDFS-14997:


Thanks [~Captainhzy] for your comments. This improvement try to process some 
high cost cmds async, and avoid to block the core flow of BPServiceActor, such 
as `DNA_INVALIDATE`.
As for your concerns, It could be not issue IMO, because 
`updateActorStatesFromHeartbeat` is very light operation, no need to async. 
Another side, we need to update namenode state at real time to avoid to execute 
some cmds from Standby. Hope it could answer your questions. Thanks.

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, 
> HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch, 
> HDFS-14997.addendum.patch, image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2020-08-20 Thread zy.jordan (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181563#comment-17181563
 ] 

zy.jordan commented on HDFS-14997:
--

Hi ,[~hexiaoqiao]

>From this code, it looks like run process cmd async instead of in heartbeat 
>thread. It looks like perfect.

But look insight, I found heartbeat thread run 
bpos.updateActorStatesFromHeartbeat() func, and updateActorStatesFromHeartbeat 
held write lock()
{code:java}
private void offerService() throws Exception {
  ...
  while (shouldRun()) {
...
bpos.updateActorStatesFromHeartbeat(this, resp.getNameNodeHaState());
...
  }
  ...
}{code}
 
{code:java}
void updateActorStatesFromHeartbeat( BPServiceActor actor, NNHAStatusHeartbeat 
nnHaState) {
  writeLock();
  try {
...
  } finally {
writeUnlock();
  }
  ...
}
{code}
And in async process cmd thread, proces cmd func also held the same lock
{code:java}
boolean processCommandFromActor(DatanodeCommand cmd, BPServiceActor actor) 
throws IOException {
  ...
  writeLock();
  try {
...
  } finally {
writeUnlock();
  }
}
{code}
On my view, the process cmd looks like not async

 

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, 
> HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch, 
> HDFS-14997.addendum.patch, image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2019-12-26 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003874#comment-17003874
 ] 

Hudson commented on HDFS-14997:
---

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #17794 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/17794/])
HDFS-14997. Addendum: BPServiceActor processes commands from NameNode 
(ayushsaxena: rev 80f91d14ab0fb385252d4eeb19141bd059303d59)
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java


> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, 
> HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch, 
> HDFS-14997.addendum.patch, image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2019-12-26 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003868#comment-17003868
 ] 

Xiaoqiao He commented on HDFS-14997:


Thanks [~ayushtkn],[~elgoiri],[~Kevin_Zheng] for your helps and reviews.

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, 
> HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch, 
> HDFS-14997.addendum.patch, image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2019-12-26 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003863#comment-17003863
 ] 

Ayush Saxena commented on HDFS-14997:
-

Committed Addendum to trunk.
Thanx [~hexiaoqiao] for the contribution, [~elgoiri] and [~Kevin_Zheng] for the 
reviews!!!

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, 
> HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch, 
> HDFS-14997.addendum.patch, image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2019-12-26 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003860#comment-17003860
 ] 

Ayush Saxena commented on HDFS-14997:
-

+1 for the addendum

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, 
> HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch, 
> HDFS-14997.addendum.patch, image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2019-12-26 Thread Zhenyu Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003827#comment-17003827
 ] 

Zhenyu Zheng commented on HDFS-14997:
-

[~hexiaoqiao] Thanks for the fix, manually patched and tested, it works. so +1

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, 
> HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch, 
> HDFS-14997.addendum.patch, image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2019-12-26 Thread Jira


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003776#comment-17003776
 ] 

Íñigo Goiri commented on HDFS-14997:


The addendum looks good to me.

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, 
> HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch, 
> HDFS-14997.addendum.patch, image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2019-12-26 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003613#comment-17003613
 ] 

Xiaoqiao He commented on HDFS-14997:


[~elgoiri] Please ignore failed unit tests reported by QA in the 
second-to-last, since I submit wrong addendum patch at the beginning. Please 
reference the last one. Thanks.

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, 
> HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch, 
> HDFS-14997.addendum.patch, image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2019-12-26 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003609#comment-17003609
 ] 

Hadoop QA commented on HDFS-14997:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
30s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 
27s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
1s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
41s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
6s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 24s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
8s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
15s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
51s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
51s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
36s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 21s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
12s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
16s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 93m 57s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
38s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}151m 27s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.server.namenode.TestRedudantBlocks |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:c44943d1fc3 |
| JIRA Issue | HDFS-14997 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12989487/HDFS-14997.addendum.patch
 |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux e72cf1da6758 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 300505c |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_232 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28574/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28574/testReport/ |
| Max. process+thread count | 3843 (vs. ulimit of 5500) |
| modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
| Console o

[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2019-12-26 Thread Jira


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003607#comment-17003607
 ] 

Íñigo Goiri commented on HDFS-14997:


I think the failed unit tests are related as we are interrupting the threads 
twice.
We should probably also set it to null once we are done in addition to checking 
if it was interrupted. 

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, 
> HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch, 
> HDFS-14997.addendum.patch, image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2019-12-26 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003578#comment-17003578
 ] 

Hadoop QA commented on HDFS-14997:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
53s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
20s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
6s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
46s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
10s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m  3s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
21s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
15s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 6s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
38s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
5s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} shadedclient {color} | {color:red} 13m 
31s{color} | {color:red} patch has errors when building and testing our client 
artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
31s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
17s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 38m 21s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
32s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}102m 35s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.TestSetrepDecreasing |
|   | hadoop.hdfs.server.blockmanagement.TestNameNodePrunesMissingStorages |
|   | hadoop.hdfs.server.datanode.TestNNHandlesCombinedBlockReport |
|   | hadoop.fs.TestResolveHdfsSymlink |
|   | hadoop.hdfs.TestClose |
|   | hadoop.hdfs.server.namenode.TestNameNodeMetadataConsistency |
|   | hadoop.fs.viewfs.TestViewFsDefaultValue |
|   | hadoop.hdfs.TestWriteBlockGetsBlockLengthHint |
|   | hadoop.hdfs.TestDatanodeConfig |
|   | hadoop.hdfs.tools.TestECAdmin |
|   | hadoop.hdfs.server.datanode.TestDataNodeErasureCodingMetrics |
|   | 
hadoop.hdfs.server.namenode.snapshot.TestSnapshotNameWithInvalidCharacters |
|   | hadoop.hdfs.TestFileChecksum |
|   | hadoop.hdfs.server.namenode.TestXAttrConfigFlag |
|   | hadoop.hdfs.web.TestWebHdfsTokens |
|   | hadoop.hdfs.security.TestDelegationTokenForProxyUser |
|   | hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting |
|   | hadoop.hdfs.server.namenode.TestQuotaByStorageType |
|   | hadoop.hdfs.TestDFSFinalize |
|   | hadoop.hdfs.TestDFSStripedInputStream |
|   | 
hadoop.hdfs.tools.offlineImageViewer.TestOfflineImageViewerForStoragePolicy |
|   | hadoop.hdfs.se

[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2019-12-26 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003574#comment-17003574
 ] 

Xiaoqiao He commented on HDFS-14997:


run {{TestFileChecksum}} at local with addendum patch, try to dump memory 
again, The number of DataNode instances and commandProcessingThread are both 
restore. Please help to double check.[~ayushtkn],[~Kevin_Zheng]. Thanks again.

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, 
> HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch, 
> HDFS-14997.addendum.patch, image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2019-12-26 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003535#comment-17003535
 ] 

Xiaoqiao He commented on HDFS-14997:


Thanks [~ayushtkn],[~Kevin_Zheng] for your reminder and very valuable comments, 
very sorry for not check this logic very carefully. Try to dump memory and 
analyze who occupy, I observe that there are over 200 DataNode object actively 
and each one occupy over 600K retained heap size when run {{TestFileChecksum}}, 
and {{commandProcessingThread}} thread not interrupt while stop DataNode. In 
one word, it is easy to meet OutOfMemoryError if there are many test methods in 
single test class and setup MiniCluster with lots of DataNode.
[^HDFS-14997.addendum.patch] try to fix this issue, Please take reviews if have 
time. cc [~elgoiri],[~weichiu]

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, 
> HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch, 
> HDFS-14997.addendum.patch, image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2019-12-26 Thread Zhenyu Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003518#comment-17003518
 ] 

Zhenyu Zheng commented on HDFS-14997:
-

Hi [~hexiaoqiao] here is the hprof we got from our tests, for 
{{TestFileChecksum}}  :

!image-2019-12-26-16-08-53-472.png! \

 

And as you can see from our CI: 
[https://builds.apache.org/job/Hadoop-qbt-linux-ARM-trunk/] we are running full 
tests once per day and these tests starts to fail serveral days ago.

I've also triggered one test for the upstream CI, it also fails for Datanode 
related tests:

[https://builds.apache.org/view/H-L/view/Hadoop/job/hadoop-multibranch/view/change-requests/job/PR-1783/1/#showFailuresLink]
 

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, 
> HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2019-12-25 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003507#comment-17003507
 ] 

Xiaoqiao He commented on HDFS-14997:


Thanks [~ayushtkn] for your reminder. I just check it and run 
{{TestFileChecksum}} at local times but not reproduce. Would like to offer some 
more information about hprof report since Jenkins log have less information to 
identify the problem. Please help to figure out link or log if I missing some 
important info. Thanks again.

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, 
> HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2019-12-25 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003471#comment-17003471
 ] 

Ayush Saxena commented on HDFS-14997:
-

Hi [~hexiaoqiao] [~elgoiri]

Seems there are couple of tests failing due to heap issue. According to the 
hprof file, 60 percent of the memory is being occupied by the Datanode Object. 
Can you give a check, I suspect related.

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, 
> HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2019-12-19 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17000249#comment-17000249
 ] 

Hudson commented on HDFS-14997:
---

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #17778 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/17778/])
HDFS-14997. BPServiceActor processes commands from NameNode (inigoiri: rev 
b86895485d95440de099831e0db38db037f16bdd)
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/metrics/DataNodeMetrics.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBPOfferService.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java


> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, 
> HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2019-12-19 Thread Jira


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17000227#comment-17000227
 ] 

Íñigo Goiri commented on HDFS-14997:


Thanks [~hexiaoqiao] for the improvement and [~weichiu] for the review.
Committed to trunk.

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, 
> HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org