date:20200324

[jira] [Commented] (HDFS-15082) RBF: Check each component length of destination path when add/update mount entry

2020-03-24 Thread Hadoop QA (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066416#comment-17066416
 ] 

Hadoop QA commented on HDFS-15082:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
37s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 32m 
20s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
59s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
42s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
2s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
22m 18s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
46s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
50s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
52s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
30s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m  3s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
8s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
26s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  7m 33s{color} 
| {color:red} hadoop-hdfs-rbf in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
27s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 89m 53s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.hdfs.server.federation.router.TestRouterFaultTolerant |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.8 Server=19.03.8 Image:yetus/hadoop:4454c6d14b7 |
| JIRA Issue | HDFS-15082 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12997621/HDFS-15082.003.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux e357f4a9735b 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 
08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / d353b30 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_242 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/29019/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/29019/testReport/ |
| Max. process+thread count | 3178 (vs. ulimit of 5500) |
| modules | C: hadoop-hdfs-project/hadoop-hdfs-rbf U: 
hadoop-hdfs-project/hadoop-hdfs-rbf |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/29019/console |
| Powered by | Apache Yetu

[jira] [Commented] (HDFS-15154) Allow only hdfs superusers the ability to assign HDFS storage policies

2020-03-24 Thread Siddharth Wagle (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066391#comment-17066391
 ] 

Siddharth Wagle commented on HDFS-15154:


Thanks [~arp] for your feedback. 
[~ayushtkn] I have updated version 15 with the minor change to replace most of 
the CaseUtils calls with string literal for operation name.

> Allow only hdfs superusers the ability to assign HDFS storage policies
> --
>
> Key: HDFS-15154
> URL: https://issues.apache.org/jira/browse/HDFS-15154
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.0.0
>Reporter: Bob Cauthen
>Assignee: Siddharth Wagle
>Priority: Major
> Attachments: HDFS-15154.01.patch, HDFS-15154.02.patch, 
> HDFS-15154.03.patch, HDFS-15154.04.patch, HDFS-15154.05.patch, 
> HDFS-15154.06.patch, HDFS-15154.07.patch, HDFS-15154.08.patch, 
> HDFS-15154.09.patch, HDFS-15154.10.patch, HDFS-15154.11.patch, 
> HDFS-15154.12.patch, HDFS-15154.13.patch, HDFS-15154.14.patch, 
> HDFS-15154.15.patch
>
>
> Please provide a way to limit only HDFS superusers the ability to assign HDFS 
> Storage Policies to HDFS directories.
> Currently, and based on Jira HDFS-7093, all storage policies can be disabled 
> cluster wide by setting the following:
> dfs.storage.policy.enabled to false
> But we need a way to allow only HDFS superusers the ability to assign an HDFS 
> Storage Policy to an HDFS directory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15154) Allow only hdfs superusers the ability to assign HDFS storage policies

2020-03-24 Thread Siddharth Wagle (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Wagle updated HDFS-15154:
---
Attachment: HDFS-15154.15.patch

> Allow only hdfs superusers the ability to assign HDFS storage policies
> --
>
> Key: HDFS-15154
> URL: https://issues.apache.org/jira/browse/HDFS-15154
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.0.0
>Reporter: Bob Cauthen
>Assignee: Siddharth Wagle
>Priority: Major
> Attachments: HDFS-15154.01.patch, HDFS-15154.02.patch, 
> HDFS-15154.03.patch, HDFS-15154.04.patch, HDFS-15154.05.patch, 
> HDFS-15154.06.patch, HDFS-15154.07.patch, HDFS-15154.08.patch, 
> HDFS-15154.09.patch, HDFS-15154.10.patch, HDFS-15154.11.patch, 
> HDFS-15154.12.patch, HDFS-15154.13.patch, HDFS-15154.14.patch, 
> HDFS-15154.15.patch
>
>
> Please provide a way to limit only HDFS superusers the ability to assign HDFS 
> Storage Policies to HDFS directories.
> Currently, and based on Jira HDFS-7093, all storage policies can be disabled 
> cluster wide by setting the following:
> dfs.storage.policy.enabled to false
> But we need a way to allow only HDFS superusers the ability to assign an HDFS 
> Storage Policy to an HDFS directory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15082) RBF: Check each component length of destination path when add/update mount entry

2020-03-24 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066381#comment-17066381
 ] 

Xiaoqiao He commented on HDFS-15082:


Thanks [~elgoiri] for picking up this issue.  submit v003 (total same as v002) 
and try to trigger Jenkins.

> RBF: Check each component length of destination path when add/update mount 
> entry
> 
>
> Key: HDFS-15082
> URL: https://issues.apache.org/jira/browse/HDFS-15082
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15082.001.patch, HDFS-15082.002.patch, 
> HDFS-15082.003.patch
>
>
> When add/update mount entry, each component length of destination path could 
> exceed filesystem path component length limit, reference to 
> `dfs.namenode.fs-limits.max-component-length` of NameNode. So we should check 
> each component length of destination path when add/update mount entry at 
> Router side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15082) RBF: Check each component length of destination path when add/update mount entry

2020-03-24 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15082:
---
Attachment: HDFS-15082.003.patch

> RBF: Check each component length of destination path when add/update mount 
> entry
> 
>
> Key: HDFS-15082
> URL: https://issues.apache.org/jira/browse/HDFS-15082
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15082.001.patch, HDFS-15082.002.patch, 
> HDFS-15082.003.patch
>
>
> When add/update mount entry, each component length of destination path could 
> exceed filesystem path component length limit, reference to 
> `dfs.namenode.fs-limits.max-component-length` of NameNode. So we should check 
> each component length of destination path when add/update mount entry at 
> Router side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-13470) RBF: Add Browse the Filesystem button to the UI

2020-03-24 Thread Hadoop QA (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-13470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066362#comment-17066362
 ] 

Hadoop QA commented on HDFS-13470:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
45s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 
25s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
34m  1s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
29s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch 36 line(s) with tabs. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 16s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
28s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 50m 35s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.8 Server=19.03.8 Image:yetus/hadoop:4454c6d14b7 |
| JIRA Issue | HDFS-13470 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12997615/HDFS-13470.001.patch |
| Optional Tests |  dupname  asflicense  shadedclient  |
| uname | Linux fd9b788c30d9 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 
08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / d353b30 |
| maven | version: Apache Maven 3.3.9 |
| whitespace | 
https://builds.apache.org/job/PreCommit-HDFS-Build/29018/artifact/out/whitespace-tabs.txt
 |
| Max. process+thread count | 305 (vs. ulimit of 5500) |
| modules | C: hadoop-hdfs-project/hadoop-hdfs-rbf U: 
hadoop-hdfs-project/hadoop-hdfs-rbf |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/29018/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> RBF: Add Browse the Filesystem button to the UI
> ---
>
> Key: HDFS-13470
> URL: https://issues.apache.org/jira/browse/HDFS-13470
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Íñigo Goiri
>Assignee: Íñigo Goiri
>Priority: Major
> Attachments: HDFS-13470.000.patch, HDFS-13470.001.patch
>
>
> After HDFS-12512 added WebHDFS, we can add the support to browse the 
> filesystem to the UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15238) RBF：NamenodeHeartbeatService caused memory to grow rapidly

2020-03-24 Thread Jira



[ 
https://issues.apache.org/jira/browse/HDFS-15238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066353#comment-17066353
 ] 

Íñigo Goiri commented on HDFS-15238:


There is a typo in "Cachec HA protocol." in [^HDFS-15238-002.patch].
I don't think there is a need for tests as this is an optimization and is 
covered already by other tests.
Other than the typo, this is ready to go.

> RBF：NamenodeHeartbeatService caused memory to grow rapidly
> --
>
> Key: HDFS-15238
> URL: https://issues.apache.org/jira/browse/HDFS-15238
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: xuzq
>Assignee: xuzq
>Priority: Major
> Attachments: HDFS-15238-002.patch, HDFS-15238-trunk-001.patch
>
>
> NamenodeHeartbeatService will get NameNode's HA status every 5s, and created 
> HAServiceProtocol every time.
> When creating HAServiceProtocol, it also will new Configuration.
> Over time, there will be more and more entries for REGISTER in Configuration 
> until fullGc happen. 
> The entry will piles up again, after reaching a certain threshold,  the 
> fullGc is triggered again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15239) Add button to go to the parent directory in the explorer

2020-03-24 Thread Jira



 [ 
https://issues.apache.org/jira/browse/HDFS-15239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Íñigo Goiri updated HDFS-15239:
---
Attachment: screenshot-1.png

> Add button to go to the parent directory in the explorer
> 
>
> Key: HDFS-15239
> URL: https://issues.apache.org/jira/browse/HDFS-15239
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Currently, when using the HDFS explorer page, it is easy to go into a folder.
> However, to go back one has to use the browser back button (if one is coming 
> from that folder) or to edit the path by hand.
> It would be nice to have the typical button to go to the parent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15239) Add button to go to the parent directory in the explorer

2020-03-24 Thread Jira



[ 
https://issues.apache.org/jira/browse/HDFS-15239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066349#comment-17066349
 ] 

Íñigo Goiri commented on HDFS-15239:


Currently it looks like this:
 !screenshot-1.png! 

> Add button to go to the parent directory in the explorer
> 
>
> Key: HDFS-15239
> URL: https://issues.apache.org/jira/browse/HDFS-15239
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Currently, when using the HDFS explorer page, it is easy to go into a folder.
> However, to go back one has to use the browser back button (if one is coming 
> from that folder) or to edit the path by hand.
> It would be nice to have the typical button to go to the parent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Created] (HDFS-15239) Add button to go to the parent directory in the explorer

2020-03-24 Thread Jira

Íñigo Goiri created HDFS-15239:
--

 Summary: Add button to go to the parent directory in the explorer
 Key: HDFS-15239
 URL: https://issues.apache.org/jira/browse/HDFS-15239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Íñigo Goiri


Currently, when using the HDFS explorer page, it is easy to go into a folder.
However, to go back one has to use the browser back button (if one is coming 
from that folder) or to edit the path by hand.
It would be nice to have the typical button to go to the parent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-13470) RBF: Add Browse the Filesystem button to the UI

2020-03-24 Thread Jira



[ 
https://issues.apache.org/jira/browse/HDFS-13470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066342#comment-17066342
 ] 

Íñigo Goiri commented on HDFS-13470:


I was playing around with trunk and I realized that the explorer never made it 
into OSS.
I updated with the latest HDFS explorer and made it consistent with the NN.
Initially, I was trying to reuse the same files but I wasn't able to do a 
proper refactor.
Having duplicated files is not optimal but I think at this point is better than 
not having this.

> RBF: Add Browse the Filesystem button to the UI
> ---
>
> Key: HDFS-13470
> URL: https://issues.apache.org/jira/browse/HDFS-13470
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Íñigo Goiri
>Assignee: Íñigo Goiri
>Priority: Major
> Attachments: HDFS-13470.000.patch, HDFS-13470.001.patch
>
>
> After HDFS-12512 added WebHDFS, we can add the support to browse the 
> filesystem to the UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13470) RBF: Add Browse the Filesystem button to the UI

2020-03-24 Thread Jira



 [ 
https://issues.apache.org/jira/browse/HDFS-13470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Íñigo Goiri updated HDFS-13470:
---
Attachment: HDFS-13470.001.patch

> RBF: Add Browse the Filesystem button to the UI
> ---
>
> Key: HDFS-13470
> URL: https://issues.apache.org/jira/browse/HDFS-13470
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Íñigo Goiri
>Assignee: Íñigo Goiri
>Priority: Major
> Attachments: HDFS-13470.000.patch, HDFS-13470.001.patch
>
>
> After HDFS-12512 added WebHDFS, we can add the support to browse the 
> filesystem to the UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15238) RBF：NamenodeHeartbeatService caused memory to grow rapidly

2020-03-24 Thread Hadoop QA (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066336#comment-17066336
 ] 

Hadoop QA commented on HDFS-15238:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
44s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
 3s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
29s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
24s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
31s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 56s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
3s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
28s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
27s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
24s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
24s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
14s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
27s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 15s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
6s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
25s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  7m 
40s{color} | {color:green} hadoop-hdfs-rbf in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
27s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 65m  4s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.8 Server=19.03.8 Image:yetus/hadoop:4454c6d14b7 |
| JIRA Issue | HDFS-15238 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12997614/HDFS-15238-002.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux e20b0a71b713 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 
08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / d353b30 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_242 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/29017/testReport/ |
| Max. process+thread count | 3593 (vs. ulimit of 5500) |
| modules | C: hadoop-hdfs-project/hadoop-hdfs-rbf U: 
hadoop-hdfs-project/hadoop-hdfs-rbf |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/29017/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> RBF：NamenodeHeartbeatService

[jira] [Commented] (HDFS-15196) RBF: RouterRpcServer getListing cannot list large dirs correctly

2020-03-24 Thread Jira



[ 
https://issues.apache.org/jira/browse/HDFS-15196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066315#comment-17066315
 ] 

Íñigo Goiri commented on HDFS-15196:


If possible, I would like to have the correct remaining entries value.
As you guys say, the client will just keep asking as long as the value is non-0.
However, this could be used for pagination or something else.
Even though, I cannot point to an example that uses this, I would try to get 
the right value in each call.

> RBF: RouterRpcServer getListing cannot list large dirs correctly
> 
>
> Key: HDFS-15196
> URL: https://issues.apache.org/jira/browse/HDFS-15196
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Critical
> Attachments: HDFS-15196.001.patch, HDFS-15196.002.patch, 
> HDFS-15196.003.patch, HDFS-15196.003.patch, HDFS-15196.004.patch, 
> HDFS-15196.005.patch, HDFS-15196.006.patch, HDFS-15196.007.patch, 
> HDFS-15196.008.patch, HDFS-15196.009.patch, HDFS-15196.010.patch
>
>
> In RouterRpcServer, getListing function is handled as two parts:
>  # Union all partial listings from destination ns + paths
>  # Append mount points for the dir to be listed
> In the case of large dir which is bigger than DFSConfigKeys.DFS_LIST_LIMIT 
> (with default value 1k), the batch listing will be used and the startAfter 
> will be used to define the boundary of each batch listing. However, step 2 
> here will add existing mount points, which will mess up with the boundary of 
> the batch, thus making the next batch startAfter wrong.
> The fix is just to append the mount points when there is no more batch query 
> necessary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15238) RBF：NamenodeHeartbeatService caused memory to grow rapidly

2020-03-24 Thread xuzq (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066311#comment-17066311
 ] 

xuzq commented on HDFS-15238:
-

Thanks [~elgoiri], please review  [^HDFS-15238-002.patch] 

> RBF：NamenodeHeartbeatService caused memory to grow rapidly
> --
>
> Key: HDFS-15238
> URL: https://issues.apache.org/jira/browse/HDFS-15238
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: xuzq
>Assignee: xuzq
>Priority: Major
> Attachments: HDFS-15238-002.patch, HDFS-15238-trunk-001.patch
>
>
> NamenodeHeartbeatService will get NameNode's HA status every 5s, and created 
> HAServiceProtocol every time.
> When creating HAServiceProtocol, it also will new Configuration.
> Over time, there will be more and more entries for REGISTER in Configuration 
> until fullGc happen. 
> The entry will piles up again, after reaching a certain threshold,  the 
> fullGc is triggered again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15238) RBF：NamenodeHeartbeatService caused memory to grow rapidly

2020-03-24 Thread xuzq (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuzq updated HDFS-15238:

Attachment: (was: HDFS-15238-trunk-002.patch)

> RBF：NamenodeHeartbeatService caused memory to grow rapidly
> --
>
> Key: HDFS-15238
> URL: https://issues.apache.org/jira/browse/HDFS-15238
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: xuzq
>Assignee: xuzq
>Priority: Major
> Attachments: HDFS-15238-002.patch, HDFS-15238-trunk-001.patch
>
>
> NamenodeHeartbeatService will get NameNode's HA status every 5s, and created 
> HAServiceProtocol every time.
> When creating HAServiceProtocol, it also will new Configuration.
> Over time, there will be more and more entries for REGISTER in Configuration 
> until fullGc happen. 
> The entry will piles up again, after reaching a certain threshold,  the 
> fullGc is triggered again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15238) RBF：NamenodeHeartbeatService caused memory to grow rapidly

2020-03-24 Thread xuzq (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuzq updated HDFS-15238:

Attachment: HDFS-15238-002.patch

> RBF：NamenodeHeartbeatService caused memory to grow rapidly
> --
>
> Key: HDFS-15238
> URL: https://issues.apache.org/jira/browse/HDFS-15238
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: xuzq
>Assignee: xuzq
>Priority: Major
> Attachments: HDFS-15238-002.patch, HDFS-15238-trunk-001.patch
>
>
> NamenodeHeartbeatService will get NameNode's HA status every 5s, and created 
> HAServiceProtocol every time.
> When creating HAServiceProtocol, it also will new Configuration.
> Over time, there will be more and more entries for REGISTER in Configuration 
> until fullGc happen. 
> The entry will piles up again, after reaching a certain threshold,  the 
> fullGc is triggered again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15238) RBF：NamenodeHeartbeatService caused memory to grow rapidly

2020-03-24 Thread xuzq (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuzq updated HDFS-15238:

Attachment: HDFS-15238-trunk-002.patch

> RBF：NamenodeHeartbeatService caused memory to grow rapidly
> --
>
> Key: HDFS-15238
> URL: https://issues.apache.org/jira/browse/HDFS-15238
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: xuzq
>Assignee: xuzq
>Priority: Major
> Attachments: HDFS-15238-trunk-001.patch, HDFS-15238-trunk-002.patch
>
>
> NamenodeHeartbeatService will get NameNode's HA status every 5s, and created 
> HAServiceProtocol every time.
> When creating HAServiceProtocol, it also will new Configuration.
> Over time, there will be more and more entries for REGISTER in Configuration 
> until fullGc happen. 
> The entry will piles up again, after reaching a certain threshold,  the 
> fullGc is triggered again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-12733) Option to disable to namenode local edits

2020-03-24 Thread Jira



[ 
https://issues.apache.org/jira/browse/HDFS-12733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066297#comment-17066297
 ] 

Íñigo Goiri commented on HDFS-12733:


I guess  [^HDFS-12733.008.patch] assumes that one would set 
DFS_NAMENODE_EDITS_DIR_KEY to null to disable the local edits?
I think is fine; can we document this behavior in the HA QJM md file mentioning 
that if we want to disable the local edits (assuming that the shared ones are 
enoug), one needs to set dfs.namenode.edits.dir to blank?
In addition, in FSNameSystem let's add a log message when 
{{!noSharedEditDirs.isEmpty()}} saying that the local edits are disabled and 
that it will rely on the shared ones.
BTW, is there a need to check that at least there is one edits location 
specified? I think that may already be implicit.


> Option to disable to namenode local edits
> -
>
> Key: HDFS-12733
> URL: https://issues.apache.org/jira/browse/HDFS-12733
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode, performance
>Reporter: Brahma Reddy Battula
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-12733-001.patch, HDFS-12733-002.patch, 
> HDFS-12733-003.patch, HDFS-12733.004.patch, HDFS-12733.005.patch, 
> HDFS-12733.006.patch, HDFS-12733.007.patch, HDFS-12733.008.patch
>
>
> As of now, Edits will be written in local and shared locations which will be 
> redundant and local edits never used in HA setup.
> Disabling local edits gives little performance improvement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15082) RBF: Check each component length of destination path when add/update mount entry

2020-03-24 Thread Jira



[ 
https://issues.apache.org/jira/browse/HDFS-15082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066291#comment-17066291
 ] 

Íñigo Goiri commented on HDFS-15082:


[~hexiaoqiao], do you mind submitting again? It has been some time and I'd like 
to make sure it still runs.
Other than that, it looks good.

> RBF: Check each component length of destination path when add/update mount 
> entry
> 
>
> Key: HDFS-15082
> URL: https://issues.apache.org/jira/browse/HDFS-15082
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15082.001.patch, HDFS-15082.002.patch
>
>
> When add/update mount entry, each component length of destination path could 
> exceed filesystem path component length limit, reference to 
> `dfs.namenode.fs-limits.max-component-length` of NameNode. So we should check 
> each component length of destination path when add/update mount entry at 
> Router side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-14385) RBF: Optimize MiniRouterDFSCluster with optional light weight MiniDFSCluster

2020-03-24 Thread Jira



[ 
https://issues.apache.org/jira/browse/HDFS-14385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066288#comment-17066288
 ] 

Íñigo Goiri commented on HDFS-14385:


[~hexiaoqiao], do you mind rebasing to trunk and posting a patch for that?
HDFS-13891 has already been merged.

> RBF: Optimize MiniRouterDFSCluster with optional light weight MiniDFSCluster
> 
>
> Key: HDFS-14385
> URL: https://issues.apache.org/jira/browse/HDFS-14385
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-14385-HDFS-13891.001.patch
>
>
> MiniRouterDFSCluster mimic federated HDFS cluster with routers to support RBF 
> test, In MiniRouterDFSCluster, it starts MiniDFSCluster with complete roles 
> of HDFS which have significant time cost. As HDFS-14351 discussed, it is 
> better to provide mock MiniDFSCluster/Namenodes as one option to support some 
> test case and reduce time cost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-13916) Distcp SnapshotDiff to support WebHDFS

2020-03-24 Thread Hadoop QA (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-13916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066267#comment-17066267
 ] 

Hadoop QA commented on HDFS-13916:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  8s{color} 
| {color:red} HDFS-13916 does not apply to trunk. Rebase required? Wrong 
Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | HDFS-13916 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12976595/HDFS-13916.006.patch |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/29016/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> Distcp SnapshotDiff to support WebHDFS
> --
>
> Key: HDFS-13916
> URL: https://issues.apache.org/jira/browse/HDFS-13916
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: distcp, webhdfs
>Affects Versions: 3.0.1, 3.1.1
>Reporter: Xun REN
>Assignee: Xun REN
>Priority: Major
>  Labels: easyfix, newbie, patch
> Attachments: HDFS-13916.002.patch, HDFS-13916.003.patch, 
> HDFS-13916.004.patch, HDFS-13916.005.patch, HDFS-13916.006.patch, 
> HDFS-13916.patch
>
>
> [~ljain] has worked on the JIRA: HDFS-13052 to provide the possibility to 
> make DistCP of SnapshotDiff with WebHDFSFileSystem. However, in the patch, 
> there is no modification for the real java class which is used by launching 
> the command "hadoop distcp ..."
>  
> You can check in the latest version here:
> [https://github.com/apache/hadoop/blob/branch-3.1.1/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpSync.java#L96-L100]
> In the method "preSyncCheck" of the class "DistCpSync", we still check if the 
> file system is DFS. 
> So I propose to change the class DistCpSync in order to take into 
> consideration what was committed by Lokesh Jain.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13916) Distcp SnapshotDiff to support WebHDFS

2020-03-24 Thread Siyao Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-13916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siyao Meng updated HDFS-13916:
--
Target Version/s: 3.3.1, 3.4.0

> Distcp SnapshotDiff to support WebHDFS
> --
>
> Key: HDFS-13916
> URL: https://issues.apache.org/jira/browse/HDFS-13916
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: distcp, webhdfs
>Affects Versions: 3.0.1, 3.1.1
>Reporter: Xun REN
>Assignee: Xun REN
>Priority: Major
>  Labels: easyfix, newbie, patch
> Attachments: HDFS-13916.002.patch, HDFS-13916.003.patch, 
> HDFS-13916.004.patch, HDFS-13916.005.patch, HDFS-13916.006.patch, 
> HDFS-13916.patch
>
>
> [~ljain] has worked on the JIRA: HDFS-13052 to provide the possibility to 
> make DistCP of SnapshotDiff with WebHDFSFileSystem. However, in the patch, 
> there is no modification for the real java class which is used by launching 
> the command "hadoop distcp ..."
>  
> You can check in the latest version here:
> [https://github.com/apache/hadoop/blob/branch-3.1.1/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpSync.java#L96-L100]
> In the method "preSyncCheck" of the class "DistCpSync", we still check if the 
> file system is DFS. 
> So I propose to change the class DistCpSync in order to take into 
> consideration what was committed by Lokesh Jain.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-14385) RBF: Optimize MiniRouterDFSCluster with optional light weight MiniDFSCluster

2020-03-24 Thread Hadoop QA (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-14385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066252#comment-17066252
 ] 

Hadoop QA commented on HDFS-14385:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} docker {color} | {color:red}  7m 
52s{color} | {color:red} Docker failed to build yetus/hadoop:bdbca0e53b4. 
{color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | HDFS-14385 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12964438/HDFS-14385-HDFS-13891.001.patch
 |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/29015/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> RBF: Optimize MiniRouterDFSCluster with optional light weight MiniDFSCluster
> 
>
> Key: HDFS-14385
> URL: https://issues.apache.org/jira/browse/HDFS-14385
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-14385-HDFS-13891.001.patch
>
>
> MiniRouterDFSCluster mimic federated HDFS cluster with routers to support RBF 
> test, In MiniRouterDFSCluster, it starts MiniDFSCluster with complete roles 
> of HDFS which have significant time cost. As HDFS-14351 discussed, it is 
> better to provide mock MiniDFSCluster/Namenodes as one option to support some 
> test case and reduce time cost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-14385) RBF: Optimize MiniRouterDFSCluster with optional light weight MiniDFSCluster

2020-03-24 Thread Wei-Chiu Chuang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-14385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066242#comment-17066242
 ] 

Wei-Chiu Chuang commented on HDFS-14385:


Triggering a rebuild: https://builds.apache.org/job/PreCommit-HDFS-Build/29014/

> RBF: Optimize MiniRouterDFSCluster with optional light weight MiniDFSCluster
> 
>
> Key: HDFS-14385
> URL: https://issues.apache.org/jira/browse/HDFS-14385
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-14385-HDFS-13891.001.patch
>
>
> MiniRouterDFSCluster mimic federated HDFS cluster with routers to support RBF 
> test, In MiniRouterDFSCluster, it starts MiniDFSCluster with complete roles 
> of HDFS which have significant time cost. As HDFS-14351 discussed, it is 
> better to provide mock MiniDFSCluster/Namenodes as one option to support some 
> test case and reduce time cost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15215) The Timestamp for longest write/read lock held log is wrong

2020-03-24 Thread Hudson (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066222#comment-17066222
 ] 

Hudson commented on HDFS-15215:
---

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #18083 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/18083/])
HDFS-15215. The Timestamp for longest write/read lock held log is wrong 
(github: rev d353b30baf6da5b70685cf837cf7095636f345e1)
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestFSNamesystemLock.java
* (edit) 
hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/FakeTimer.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystemLock.java


> The Timestamp for longest write/read lock held log is wrong
> ---
>
> Key: HDFS-15215
> URL: https://issues.apache.org/jira/browse/HDFS-15215
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Toshihiro Suzuki
>Assignee: Toshihiro Suzuki
>Priority: Major
> Fix For: 3.3.0
>
>
> I found the Timestamp for longest write/read lock held log is wrong in trunk:
> {code}
> 2020-03-10 16:01:26,585 [main] INFO  namenode.FSNamesystem 
> (FSNamesystemLock.java:writeUnlock(281)) - Number of suppressed 
> write-lock reports: 0
>   Longest write-lock held at 1970-01-03 07:07:40,841+0900 for 3ms via 
> java.lang.Thread.getStackTrace(Thread.java:1559)
> ...
> {code}
> Looking at the code, it looks like the timestamp comes from System.nanoTime() 
> that returns the current value of the running Java Virtual Machine's 
> high-resolution time source and this method can only be used to measure 
> elapsed time:
> https://docs.oracle.com/javase/8/docs/api/java/lang/System.html#nanoTime--
> We need to make the timestamp from System.currentTimeMillis().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15154) Allow only hdfs superusers the ability to assign HDFS storage policies

2020-03-24 Thread Arpit Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066220#comment-17066220
 ] 

Arpit Agarwal commented on HDFS-15154:
--

Sorry I didn't get time to look at this in detail. [~ayushtkn] if the changes 
look good to you then please go ahead. Main thing would be to ensure there is 
no incompatibility introduced by the change.

> Allow only hdfs superusers the ability to assign HDFS storage policies
> --
>
> Key: HDFS-15154
> URL: https://issues.apache.org/jira/browse/HDFS-15154
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.0.0
>Reporter: Bob Cauthen
>Assignee: Siddharth Wagle
>Priority: Major
> Attachments: HDFS-15154.01.patch, HDFS-15154.02.patch, 
> HDFS-15154.03.patch, HDFS-15154.04.patch, HDFS-15154.05.patch, 
> HDFS-15154.06.patch, HDFS-15154.07.patch, HDFS-15154.08.patch, 
> HDFS-15154.09.patch, HDFS-15154.10.patch, HDFS-15154.11.patch, 
> HDFS-15154.12.patch, HDFS-15154.13.patch, HDFS-15154.14.patch
>
>
> Please provide a way to limit only HDFS superusers the ability to assign HDFS 
> Storage Policies to HDFS directories.
> Currently, and based on Jira HDFS-7093, all storage policies can be disabled 
> cluster wide by setting the following:
> dfs.storage.policy.enabled to false
> But we need a way to allow only HDFS superusers the ability to assign an HDFS 
> Storage Policy to an HDFS directory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-15154) Allow only hdfs superusers the ability to assign HDFS storage policies

2020-03-24 Thread Arpit Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066220#comment-17066220
 ] 

Arpit Agarwal edited comment on HDFS-15154 at 3/24/20, 10:03 PM:
-

Sorry I didn't get time to look at this in detail. [~ayushtkn] if the changes 
look good to you then please go ahead and commit. Main thing would be to ensure 
there is no incompatibility introduced by the change.


was (Author: arpitagarwal):
Sorry I didn't get time to look at this in detail. [~ayushtkn] if the changes 
look good to you then please go ahead. Main thing would be to ensure there is 
no incompatibility introduced by the change.

> Allow only hdfs superusers the ability to assign HDFS storage policies
> --
>
> Key: HDFS-15154
> URL: https://issues.apache.org/jira/browse/HDFS-15154
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.0.0
>Reporter: Bob Cauthen
>Assignee: Siddharth Wagle
>Priority: Major
> Attachments: HDFS-15154.01.patch, HDFS-15154.02.patch, 
> HDFS-15154.03.patch, HDFS-15154.04.patch, HDFS-15154.05.patch, 
> HDFS-15154.06.patch, HDFS-15154.07.patch, HDFS-15154.08.patch, 
> HDFS-15154.09.patch, HDFS-15154.10.patch, HDFS-15154.11.patch, 
> HDFS-15154.12.patch, HDFS-15154.13.patch, HDFS-15154.14.patch
>
>
> Please provide a way to limit only HDFS superusers the ability to assign HDFS 
> Storage Policies to HDFS directories.
> Currently, and based on Jira HDFS-7093, all storage policies can be disabled 
> cluster wide by setting the following:
> dfs.storage.policy.enabled to false
> But we need a way to allow only HDFS superusers the ability to assign an HDFS 
> Storage Policy to an HDFS directory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15215) The Timestamp for longest write/read lock held log is wrong

2020-03-24 Thread Jira



[ 
https://issues.apache.org/jira/browse/HDFS-15215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066209#comment-17066209
 ] 

Íñigo Goiri commented on HDFS-15215:


Thanks [~brfrn169] for the fix.
Merged the PR to trunk.

> The Timestamp for longest write/read lock held log is wrong
> ---
>
> Key: HDFS-15215
> URL: https://issues.apache.org/jira/browse/HDFS-15215
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Toshihiro Suzuki
>Assignee: Toshihiro Suzuki
>Priority: Major
> Fix For: 3.3.0
>
>
> I found the Timestamp for longest write/read lock held log is wrong in trunk:
> {code}
> 2020-03-10 16:01:26,585 [main] INFO  namenode.FSNamesystem 
> (FSNamesystemLock.java:writeUnlock(281)) - Number of suppressed 
> write-lock reports: 0
>   Longest write-lock held at 1970-01-03 07:07:40,841+0900 for 3ms via 
> java.lang.Thread.getStackTrace(Thread.java:1559)
> ...
> {code}
> Looking at the code, it looks like the timestamp comes from System.nanoTime() 
> that returns the current value of the running Java Virtual Machine's 
> high-resolution time source and this method can only be used to measure 
> elapsed time:
> https://docs.oracle.com/javase/8/docs/api/java/lang/System.html#nanoTime--
> We need to make the timestamp from System.currentTimeMillis().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-15215) The Timestamp for longest write/read lock held log is wrong

2020-03-24 Thread Jira



 [ 
https://issues.apache.org/jira/browse/HDFS-15215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Íñigo Goiri resolved HDFS-15215.

Fix Version/s: 3.3.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> The Timestamp for longest write/read lock held log is wrong
> ---
>
> Key: HDFS-15215
> URL: https://issues.apache.org/jira/browse/HDFS-15215
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Toshihiro Suzuki
>Assignee: Toshihiro Suzuki
>Priority: Major
> Fix For: 3.3.0
>
>
> I found the Timestamp for longest write/read lock held log is wrong in trunk:
> {code}
> 2020-03-10 16:01:26,585 [main] INFO  namenode.FSNamesystem 
> (FSNamesystemLock.java:writeUnlock(281)) - Number of suppressed 
> write-lock reports: 0
>   Longest write-lock held at 1970-01-03 07:07:40,841+0900 for 3ms via 
> java.lang.Thread.getStackTrace(Thread.java:1559)
> ...
> {code}
> Looking at the code, it looks like the timestamp comes from System.nanoTime() 
> that returns the current value of the running Java Virtual Machine's 
> high-resolution time source and this method can only be used to measure 
> elapsed time:
> https://docs.oracle.com/javase/8/docs/api/java/lang/System.html#nanoTime--
> We need to make the timestamp from System.currentTimeMillis().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15075) Remove process command timing from BPServiceActor

2020-03-24 Thread Jira



[ 
https://issues.apache.org/jira/browse/HDFS-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066208#comment-17066208
 ] 

Íñigo Goiri commented on HDFS-15075:


Thanks [~weichiu] for the comment, I think it is fair.
[~hexiaoqiao] do you mind limitting this JIRA to fixing the metrics in 
BPServiceActor and create a separate JIRA for the other metrics?

> Remove process command timing from BPServiceActor
> -
>
> Key: HDFS-15075
> URL: https://issues.apache.org/jira/browse/HDFS-15075
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15075.001.patch, HDFS-15075.002.patch, 
> HDFS-15075.003.patch, HDFS-15075.004.patch, HDFS-15075.005.patch, 
> HDFS-15075.006.patch, HDFS-15075.007.patch
>
>
> HDFS-14997 moved the command processing into async.
> Right now, we are checking the time to add to a queue.
> We should remove this one and maybe move the timing within the thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15238) RBF：NamenodeHeartbeatService caused memory to grow rapidly

2020-03-24 Thread Jira



[ 
https://issues.apache.org/jira/browse/HDFS-15238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066203#comment-17066203
 ] 

Íñigo Goiri commented on HDFS-15238:


BTW, I'm not sure if Yetus will catch this patch file name, let's call it 
HDFS-15238.002.patch just in case.

> RBF：NamenodeHeartbeatService caused memory to grow rapidly
> --
>
> Key: HDFS-15238
> URL: https://issues.apache.org/jira/browse/HDFS-15238
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: xuzq
>Assignee: xuzq
>Priority: Major
> Attachments: HDFS-15238-trunk-001.patch
>
>
> NamenodeHeartbeatService will get NameNode's HA status every 5s, and created 
> HAServiceProtocol every time.
> When creating HAServiceProtocol, it also will new Configuration.
> Over time, there will be more and more entries for REGISTER in Configuration 
> until fullGc happen. 
> The entry will piles up again, after reaching a certain threshold,  the 
> fullGc is triggered again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15238) RBF：NamenodeHeartbeatService caused memory to grow rapidly

2020-03-24 Thread Jira



[ 
https://issues.apache.org/jira/browse/HDFS-15238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066202#comment-17066202
 ] 

Íñigo Goiri commented on HDFS-15238:


It makes sense to cache it; I'm guessing you have profiled this and the size is 
significant?
We make want to reset it with the exceptions just in case.
So my comments would be:
* When we catch the exception, set localTargetHAProtocol to null.
* Add a javadoc comment to localTargetHAProtocol:
{code}
/** Cachec HA protocol. */
private HAServiceProtocol localTargetHAProtocol;
{code}

> RBF：NamenodeHeartbeatService caused memory to grow rapidly
> --
>
> Key: HDFS-15238
> URL: https://issues.apache.org/jira/browse/HDFS-15238
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: xuzq
>Assignee: xuzq
>Priority: Major
> Attachments: HDFS-15238-trunk-001.patch
>
>
> NamenodeHeartbeatService will get NameNode's HA status every 5s, and created 
> HAServiceProtocol every time.
> When creating HAServiceProtocol, it also will new Configuration.
> Over time, there will be more and more entries for REGISTER in Configuration 
> until fullGc happen. 
> The entry will piles up again, after reaching a certain threshold,  the 
> fullGc is triggered again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15191) EOF when reading legacy buffer in BlockTokenIdentifier

2020-03-24 Thread Hadoop QA (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066165#comment-17066165
 ] 

Hadoop QA commented on HDFS-15191:
--

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
45s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
11s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 
38s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
18s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
54s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
17m  6s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  5m  
3s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
9s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
10s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
14s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  3m 
15s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 10s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  5m 
13s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
4s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  1m 
52s{color} | {color:green} hadoop-hdfs-client in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}104m 
21s{color} | {color:green} hadoop-hdfs in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
33s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}184m 10s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.8 Server=19.03.8 Image:yetus/hadoop:4454c6d14b7 |
| JIRA Issue | HDFS-15191 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12997584/HDFS-15191.004.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 24ea7c8c3a8e 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 
08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 4454c6d |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_242 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/29013/testReport/ |
| Max. process+thread count | 2803 (vs. ulimit of 5500) |
| modu

[jira] [Commented] (HDFS-15219) DFS Client will stuck when ResponseProcessor.run throw Error

2020-03-24 Thread Hudson (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066072#comment-17066072
 ] 

Hudson commented on HDFS-15219:
---

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #18082 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/18082/])
HDFS-15219. DFS Client will stuck when ResponseProcessor.run throw Error 
(github: rev d9c4f1129c0814ab61fce6ea8baf4b272f84c252)
* (edit) 
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java


> DFS Client will stuck when ResponseProcessor.run throw Error
> 
>
> Key: HDFS-15219
> URL: https://issues.apache.org/jira/browse/HDFS-15219
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 2.7.3
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Major
> Fix For: 3.3.0
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> In my case, a Tez application stucked more than 2 hours util we kill this 
> applicaiton. The Reason is a task attempt stucked, becuase speculative 
> execution is disable. 
> Then Exception like this:
> {code:java}
> 2020-03-11 01:23:59,141 [INFO] [TezChild] |exec.MapOperator|: MAP[4]: records 
> read - 10
> 2020-03-11 01:24:50,294 [INFO] [TezChild] |exec.FileSinkOperator|: FS[3]: 
> records written - 100
> 2020-03-11 01:24:50,294 [INFO] [TezChild] |exec.MapOperator|: MAP[4]: records 
> read - 100
> 2020-03-11 01:29:02,967 [FATAL] [ResponseProcessor for block 
> BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073] 
> |yarn.YarnUncaughtExceptionHandler|: Thread Thread[ResponseProcessor for 
> block 
> BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073,5,main] 
> threw an Error. Shutting down now...
> java.lang.NoClassDefFoundError: com/google/protobuf/TextFormat
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.toString(PipelineAck.java:253)
>  at java.lang.String.valueOf(String.java:2847)
>  at java.lang.StringBuilder.append(StringBuilder.java:128)
>  at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:737)
> Caused by: java.lang.ClassNotFoundException: com.google.protobuf.TextFormat
>  at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
>  at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>  ... 4 more
> Caused by: java.util.zip.ZipException: error reading zip file
>  at java.util.zip.ZipFile.read(Native Method)
>  at java.util.zip.ZipFile.access$1400(ZipFile.java:56)
>  at java.util.zip.ZipFile$ZipFileInputStream.read(ZipFile.java:679)
>  at java.util.zip.ZipFile$ZipFileInflaterInputStream.fill(ZipFile.java:415)
>  at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
>  at sun.misc.Resource.getBytes(Resource.java:124)
>  at java.net.URLClassLoader.defineClass(URLClassLoader.java:444)
>  at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>  at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>  ... 10 more
> 2020-03-11 01:29:02,970 [INFO] [ResponseProcessor for block 
> BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073] 
> |util.ExitUtil|: Exiting with status -1
> 2020-03-11 03:27:26,833 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: 
> Received should die response from AM
> 2020-03-11 03:27:26,834 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: 
> Asked to die via task heartbeat
> 2020-03-11 03:27:26,839 [INFO] [TaskHeartbeatThread] |task.TezTaskRunner2|: 
> Attempting to abort attempt_1583335296048_917815_3_01_000704_0 due to an 
> invocation of shutdownRequested
> {code}
> Reason is UncaughtException. When time is 01:29, a disk was error, so throw 
> NoClassDefFoundError. ResponseProcessor.run only catch Exception, can't catch 
> NoClassDefFoundError. So the ReponseProcessor didn't set errorState. Then 
> DataStream didn't know ReponseProcessor was dead, and can't trigger 
> closeResponder, so stucked in DataStream.run.
>  I tested in unit-test TestDataStream.testDfsClient. When I throw 
> NoClassDefFoundError in ResponseProcessor.run, the 
> TestDataStream.testDfsClient will failed bacause of timeout.
> I think we should catch Throwable but not Exception in ReponseProcessor.run.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.o

[jira] [Commented] (HDFS-15235) Transient network failure during NameNode failover makes cluster unavailable

2020-03-24 Thread YCozy (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066043#comment-17066043
 ] 

YCozy commented on HDFS-15235:
--

[~ayushtkn] Thanks for looking at this! I'll try to upload a UT and a fix.

> Transient network failure during NameNode failover makes cluster unavailable
> 
>
> Key: HDFS-15235
> URL: https://issues.apache.org/jira/browse/HDFS-15235
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: YCozy
>Priority: Major
>
> We have an HA cluster with two NameNodes: an active NN1 and a standby NN2. At 
> some point, NN1 becomes unhealthy and the admin tries to manually failover to 
> NN2 by running command
> {code:java}
> $ hdfs haadmin -failover NN1 NN2
> {code}
> NN2 receives the request and becomes active:
> {code:java}
> 2020-03-24 00:24:56,412 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Stopping services 
> started for standby state
> 2020-03-24 00:24:56,413 WARN 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Edit log tailer 
> interrupted: sleep interrupted
> 2020-03-24 00:24:56,415 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services 
> required for active state
> 2020-03-24 00:24:56,417 INFO 
> org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Recovering 
> unfinalized segments in /app/ha-name-dir-shared/current
> 2020-03-24 00:24:56,419 INFO 
> org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Recovering 
> unfinalized segments in /app/nn2/name/current
> 2020-03-24 00:24:56,419 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Catching up to latest 
> edits from old active before taking over writer role in edits logs
> 2020-03-24 00:24:56,435 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Reading 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@7c3095fa 
> expecting start txid #1
> 2020-03-24 00:24:56,436 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Start loading edits file 
> /app/ha-name-dir-shared/current/edits_001-019 
> maxTxnsToRead = 9223372036854775807
> 2020-03-24 00:24:56,441 INFO 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
> Fast-forwarding stream 
> '/app/ha-name-dir-shared/current/edits_001-019'
>  to transaction ID 1
> 2020-03-24 00:24:56,567 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Loaded 1 edits file(s) (the last named 
> /app/ha-name-dir-shared/current/edits_001-019)
>  of total size 1305.0, total edits 19.0, total load time 109.0 ms
> 2020-03-24 00:24:56,567 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Marking all 
> datanodes as stale
> 2020-03-24 00:24:56,568 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Processing 4 
> messages from DataNodes that were previously queued during standby state
> 2020-03-24 00:24:56,569 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Reprocessing replication 
> and invalidation queues
> 2020-03-24 00:24:56,569 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: initializing 
> replication queues
> 2020-03-24 00:24:56,570 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Will take over writing 
> edit logs at txnid 20
> 2020-03-24 00:24:56,571 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 20
> 2020-03-24 00:24:56,812 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory: Initializing quota with 4 
> thread(s)
> 2020-03-24 00:24:56,819 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory: Quota initialization 
> completed in 6 millisecondsname space=3storage space=24690storage 
> types=RAM_DISK=0, SSD=0, DISK=0, ARCHIVE=0, PROVIDED=0
> 2020-03-24 00:24:56,827 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: 
> Starting CacheReplicationMonitor with interval 3 milliseconds
> {code}
> But NN2 fails to send back the RPC response because of temporary network 
> partitioning.
> {code:java}
> java.io.EOFException: End of File Exception between local host is: 
> "24e7b5a52e85/172.17.0.2"; destination host is: "127.0.0.3":8180; : 
> java.io.EOFException; For more details see:  
> http://wiki.apache.org/hadoop/EOFException
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
>         at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>         at org.apache.hadoop

[jira] [Commented] (HDFS-15235) Transient network failure during NameNode failover makes cluster unavailable

2020-03-24 Thread YCozy (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066042#comment-17066042
 ] 

YCozy commented on HDFS-15235:
--

A bit more info:

After NN2 fails to send back a response, haadmin first tries to fail back to 
NN1 before it fences NN2. The failing back succeeds, i.e., NN2 becomes back to 
standby and NN1 becomes back to active. Here's a log snippet from NN2 (the 
following log is from a different run, so please ignore the timestamps):

 
{code:java}
2020-03-24 17:17:27,254 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Stopping services started 
for active state
2020-03-24 17:17:27,255 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: 
Ending log segment 19, 19
2020-03-24 17:17:27,255 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: LazyPersistFileScrubber 
was interrupted, exiting
2020-03-24 17:17:27,256 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: NameNodeEditLogRoller was 
interrupted, exiting
2020-03-24 17:17:27,257 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: 
Number of transactions: 2 Total time for transactions(ms): 5 Number of 
transactions batched in Syncs: 18 Number of syncs: 3 SyncTimes(ms): 4 6 
2020-03-24 17:17:27,259 INFO 
org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits 
file /app/ha-name-dir-shared/current/edits_inprogress_019 -> 
/app/ha-name-dir-shared/current/edits_019-020
2020-03-24 17:17:27,285 INFO 
org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits 
file /app/nn2/name/current/edits_inprogress_019 -> 
/app/nn2/name/current/edits_019-020
2020-03-24 17:17:27,286 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: 
FSEditLogAsync was interrupted, exiting
2020-03-24 17:17:27,290 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Shutting 
down CacheReplicationMonitor
2020-03-24 17:17:27,290 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required 
for standby state
2020-03-24 17:17:27,294 INFO 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Will roll logs on 
active node every 120 seconds.
2020-03-24 17:17:27,299 INFO 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Starting standby 
checkpoint thread...
{code}
 

Since we only want to make sure that "only one NameNode be in the Active state 
at any given time" ([from the description of the fencing 
config|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html#Configuration_details]),
 and the successful fail-back has already achieved it, we shouldn't kill NN2.

> Transient network failure during NameNode failover makes cluster unavailable
> 
>
> Key: HDFS-15235
> URL: https://issues.apache.org/jira/browse/HDFS-15235
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: YCozy
>Priority: Major
>
> We have an HA cluster with two NameNodes: an active NN1 and a standby NN2. At 
> some point, NN1 becomes unhealthy and the admin tries to manually failover to 
> NN2 by running command
> {code:java}
> $ hdfs haadmin -failover NN1 NN2
> {code}
> NN2 receives the request and becomes active:
> {code:java}
> 2020-03-24 00:24:56,412 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Stopping services 
> started for standby state
> 2020-03-24 00:24:56,413 WARN 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Edit log tailer 
> interrupted: sleep interrupted
> 2020-03-24 00:24:56,415 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services 
> required for active state
> 2020-03-24 00:24:56,417 INFO 
> org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Recovering 
> unfinalized segments in /app/ha-name-dir-shared/current
> 2020-03-24 00:24:56,419 INFO 
> org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Recovering 
> unfinalized segments in /app/nn2/name/current
> 2020-03-24 00:24:56,419 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Catching up to latest 
> edits from old active before taking over writer role in edits logs
> 2020-03-24 00:24:56,435 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Reading 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@7c3095fa 
> expecting start txid #1
> 2020-03-24 00:24:56,436 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Start loading edits file 
> /app/ha-name-dir-shared/current/edits_001-019 
> maxTxnsToRead = 9223372036854775807
> 2020-03-24 00:24:56,441 INFO 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
> Fast-forwarding s

[jira] [Commented] (HDFS-13377) The owner of folder can set quota for his sub folder

2020-03-24 Thread Hudson (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-13377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066039#comment-17066039
 ] 

Hudson commented on HDFS-13377:
---

FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #18081 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/18081/])
HDFS-13377. The owner of folder can set quota for his sub folder. (ayushsaxena: 
rev ea87d6049340d1df040047aa08ce7784c03dd69e)
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirAttrOp.java
* (add) 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestQuotaAllowOwner.java
* (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestQuota.java


> The owner of folder can set quota for his sub folder
> 
>
> Key: HDFS-13377
> URL: https://issues.apache.org/jira/browse/HDFS-13377
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yang Yun
>Assignee: Yang Yun
>Priority: Minor
> Fix For: 3.3.0
>
> Attachments: HDFS-13377.003.patch, HDFS-13377.004.patch, 
> HDFS-13377.005.patch, HDFS-13377.006.patch, HDFS-13377.patch, 
> HDFS-13377.patch, HDFS-13377.patch
>
>
> Currently, only  super user can set quota. That is huge burden for 
> administrator in a large system. Add a new feature to let the owner of a 
> folder also has the privilege to set quota for his sub folders. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15219) DFS Client will stuck when ResponseProcessor.run throw Error

2020-03-24 Thread Ayush Saxena (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena updated HDFS-15219:

Fix Version/s: 3.3.0
 Hadoop Flags: Reviewed
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> DFS Client will stuck when ResponseProcessor.run throw Error
> 
>
> Key: HDFS-15219
> URL: https://issues.apache.org/jira/browse/HDFS-15219
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 2.7.3
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Major
> Fix For: 3.3.0
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> In my case, a Tez application stucked more than 2 hours util we kill this 
> applicaiton. The Reason is a task attempt stucked, becuase speculative 
> execution is disable. 
> Then Exception like this:
> {code:java}
> 2020-03-11 01:23:59,141 [INFO] [TezChild] |exec.MapOperator|: MAP[4]: records 
> read - 10
> 2020-03-11 01:24:50,294 [INFO] [TezChild] |exec.FileSinkOperator|: FS[3]: 
> records written - 100
> 2020-03-11 01:24:50,294 [INFO] [TezChild] |exec.MapOperator|: MAP[4]: records 
> read - 100
> 2020-03-11 01:29:02,967 [FATAL] [ResponseProcessor for block 
> BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073] 
> |yarn.YarnUncaughtExceptionHandler|: Thread Thread[ResponseProcessor for 
> block 
> BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073,5,main] 
> threw an Error. Shutting down now...
> java.lang.NoClassDefFoundError: com/google/protobuf/TextFormat
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.toString(PipelineAck.java:253)
>  at java.lang.String.valueOf(String.java:2847)
>  at java.lang.StringBuilder.append(StringBuilder.java:128)
>  at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:737)
> Caused by: java.lang.ClassNotFoundException: com.google.protobuf.TextFormat
>  at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
>  at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>  ... 4 more
> Caused by: java.util.zip.ZipException: error reading zip file
>  at java.util.zip.ZipFile.read(Native Method)
>  at java.util.zip.ZipFile.access$1400(ZipFile.java:56)
>  at java.util.zip.ZipFile$ZipFileInputStream.read(ZipFile.java:679)
>  at java.util.zip.ZipFile$ZipFileInflaterInputStream.fill(ZipFile.java:415)
>  at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
>  at sun.misc.Resource.getBytes(Resource.java:124)
>  at java.net.URLClassLoader.defineClass(URLClassLoader.java:444)
>  at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>  at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>  ... 10 more
> 2020-03-11 01:29:02,970 [INFO] [ResponseProcessor for block 
> BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073] 
> |util.ExitUtil|: Exiting with status -1
> 2020-03-11 03:27:26,833 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: 
> Received should die response from AM
> 2020-03-11 03:27:26,834 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: 
> Asked to die via task heartbeat
> 2020-03-11 03:27:26,839 [INFO] [TaskHeartbeatThread] |task.TezTaskRunner2|: 
> Attempting to abort attempt_1583335296048_917815_3_01_000704_0 due to an 
> invocation of shutdownRequested
> {code}
> Reason is UncaughtException. When time is 01:29, a disk was error, so throw 
> NoClassDefFoundError. ResponseProcessor.run only catch Exception, can't catch 
> NoClassDefFoundError. So the ReponseProcessor didn't set errorState. Then 
> DataStream didn't know ReponseProcessor was dead, and can't trigger 
> closeResponder, so stucked in DataStream.run.
>  I tested in unit-test TestDataStream.testDfsClient. When I throw 
> NoClassDefFoundError in ResponseProcessor.run, the 
> TestDataStream.testDfsClient will failed bacause of timeout.
> I think we should catch Throwable but not Exception in ReponseProcessor.run.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15219) DFS Client will stuck when ResponseProcessor.run throw Error

2020-03-24 Thread Ayush Saxena (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066029#comment-17066029
 ] 

Ayush Saxena commented on HDFS-15219:
-

Merged PR. Thanx [~zhengchenyu]  for the contribution.

> DFS Client will stuck when ResponseProcessor.run throw Error
> 
>
> Key: HDFS-15219
> URL: https://issues.apache.org/jira/browse/HDFS-15219
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 2.7.3
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Major
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> In my case, a Tez application stucked more than 2 hours util we kill this 
> applicaiton. The Reason is a task attempt stucked, becuase speculative 
> execution is disable. 
> Then Exception like this:
> {code:java}
> 2020-03-11 01:23:59,141 [INFO] [TezChild] |exec.MapOperator|: MAP[4]: records 
> read - 10
> 2020-03-11 01:24:50,294 [INFO] [TezChild] |exec.FileSinkOperator|: FS[3]: 
> records written - 100
> 2020-03-11 01:24:50,294 [INFO] [TezChild] |exec.MapOperator|: MAP[4]: records 
> read - 100
> 2020-03-11 01:29:02,967 [FATAL] [ResponseProcessor for block 
> BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073] 
> |yarn.YarnUncaughtExceptionHandler|: Thread Thread[ResponseProcessor for 
> block 
> BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073,5,main] 
> threw an Error. Shutting down now...
> java.lang.NoClassDefFoundError: com/google/protobuf/TextFormat
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.toString(PipelineAck.java:253)
>  at java.lang.String.valueOf(String.java:2847)
>  at java.lang.StringBuilder.append(StringBuilder.java:128)
>  at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:737)
> Caused by: java.lang.ClassNotFoundException: com.google.protobuf.TextFormat
>  at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
>  at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>  ... 4 more
> Caused by: java.util.zip.ZipException: error reading zip file
>  at java.util.zip.ZipFile.read(Native Method)
>  at java.util.zip.ZipFile.access$1400(ZipFile.java:56)
>  at java.util.zip.ZipFile$ZipFileInputStream.read(ZipFile.java:679)
>  at java.util.zip.ZipFile$ZipFileInflaterInputStream.fill(ZipFile.java:415)
>  at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
>  at sun.misc.Resource.getBytes(Resource.java:124)
>  at java.net.URLClassLoader.defineClass(URLClassLoader.java:444)
>  at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>  at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>  ... 10 more
> 2020-03-11 01:29:02,970 [INFO] [ResponseProcessor for block 
> BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073] 
> |util.ExitUtil|: Exiting with status -1
> 2020-03-11 03:27:26,833 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: 
> Received should die response from AM
> 2020-03-11 03:27:26,834 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: 
> Asked to die via task heartbeat
> 2020-03-11 03:27:26,839 [INFO] [TaskHeartbeatThread] |task.TezTaskRunner2|: 
> Attempting to abort attempt_1583335296048_917815_3_01_000704_0 due to an 
> invocation of shutdownRequested
> {code}
> Reason is UncaughtException. When time is 01:29, a disk was error, so throw 
> NoClassDefFoundError. ResponseProcessor.run only catch Exception, can't catch 
> NoClassDefFoundError. So the ReponseProcessor didn't set errorState. Then 
> DataStream didn't know ReponseProcessor was dead, and can't trigger 
> closeResponder, so stucked in DataStream.run.
>  I tested in unit-test TestDataStream.testDfsClient. When I throw 
> NoClassDefFoundError in ResponseProcessor.run, the 
> TestDataStream.testDfsClient will failed bacause of timeout.
> I think we should catch Throwable but not Exception in ReponseProcessor.run.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15235) Transient network failure during NameNode failover makes cluster unavailable

2020-03-24 Thread Ayush Saxena (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066019#comment-17066019
 ] 

Ayush Saxena commented on HDFS-15235:
-

Is there at UT to reproduce, or do you have a fix to contribute?

> Transient network failure during NameNode failover makes cluster unavailable
> 
>
> Key: HDFS-15235
> URL: https://issues.apache.org/jira/browse/HDFS-15235
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: YCozy
>Priority: Major
>
> We have an HA cluster with two NameNodes: an active NN1 and a standby NN2. At 
> some point, NN1 becomes unhealthy and the admin tries to manually failover to 
> NN2 by running command
> {code:java}
> $ hdfs haadmin -failover NN1 NN2
> {code}
> NN2 receives the request and becomes active:
> {code:java}
> 2020-03-24 00:24:56,412 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Stopping services 
> started for standby state
> 2020-03-24 00:24:56,413 WARN 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Edit log tailer 
> interrupted: sleep interrupted
> 2020-03-24 00:24:56,415 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services 
> required for active state
> 2020-03-24 00:24:56,417 INFO 
> org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Recovering 
> unfinalized segments in /app/ha-name-dir-shared/current
> 2020-03-24 00:24:56,419 INFO 
> org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Recovering 
> unfinalized segments in /app/nn2/name/current
> 2020-03-24 00:24:56,419 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Catching up to latest 
> edits from old active before taking over writer role in edits logs
> 2020-03-24 00:24:56,435 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Reading 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@7c3095fa 
> expecting start txid #1
> 2020-03-24 00:24:56,436 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Start loading edits file 
> /app/ha-name-dir-shared/current/edits_001-019 
> maxTxnsToRead = 9223372036854775807
> 2020-03-24 00:24:56,441 INFO 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
> Fast-forwarding stream 
> '/app/ha-name-dir-shared/current/edits_001-019'
>  to transaction ID 1
> 2020-03-24 00:24:56,567 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Loaded 1 edits file(s) (the last named 
> /app/ha-name-dir-shared/current/edits_001-019)
>  of total size 1305.0, total edits 19.0, total load time 109.0 ms
> 2020-03-24 00:24:56,567 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Marking all 
> datanodes as stale
> 2020-03-24 00:24:56,568 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Processing 4 
> messages from DataNodes that were previously queued during standby state
> 2020-03-24 00:24:56,569 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Reprocessing replication 
> and invalidation queues
> 2020-03-24 00:24:56,569 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: initializing 
> replication queues
> 2020-03-24 00:24:56,570 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Will take over writing 
> edit logs at txnid 20
> 2020-03-24 00:24:56,571 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 20
> 2020-03-24 00:24:56,812 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory: Initializing quota with 4 
> thread(s)
> 2020-03-24 00:24:56,819 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory: Quota initialization 
> completed in 6 millisecondsname space=3storage space=24690storage 
> types=RAM_DISK=0, SSD=0, DISK=0, ARCHIVE=0, PROVIDED=0
> 2020-03-24 00:24:56,827 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: 
> Starting CacheReplicationMonitor with interval 3 milliseconds
> {code}
> But NN2 fails to send back the RPC response because of temporary network 
> partitioning.
> {code:java}
> java.io.EOFException: End of File Exception between local host is: 
> "24e7b5a52e85/172.17.0.2"; destination host is: "127.0.0.3":8180; : 
> java.io.EOFException; For more details see:  
> http://wiki.apache.org/hadoop/EOFException
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
>         at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>         at org.apache.ha

[jira] [Updated] (HDFS-13377) The owner of folder can set quota for his sub folder

2020-03-24 Thread Ayush Saxena (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-13377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena updated HDFS-13377:

Fix Version/s: 3.3.0
 Hadoop Flags: Reviewed
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> The owner of folder can set quota for his sub folder
> 
>
> Key: HDFS-13377
> URL: https://issues.apache.org/jira/browse/HDFS-13377
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yang Yun
>Assignee: Yang Yun
>Priority: Minor
> Fix For: 3.3.0
>
> Attachments: HDFS-13377.003.patch, HDFS-13377.004.patch, 
> HDFS-13377.005.patch, HDFS-13377.006.patch, HDFS-13377.patch, 
> HDFS-13377.patch, HDFS-13377.patch
>
>
> Currently, only  super user can set quota. That is huge burden for 
> administrator in a large system. Add a new feature to let the owner of a 
> folder also has the privilege to set quota for his sub folders. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-13377) The owner of folder can set quota for his sub folder

2020-03-24 Thread Ayush Saxena (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-13377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066012#comment-17066012
 ] 

Ayush Saxena commented on HDFS-13377:
-

+1 for v006,

Committed to trunk. Thanx [~hadoop_yangyun] for the contribution and [~elgoiri] 
for the review!!!

> The owner of folder can set quota for his sub folder
> 
>
> Key: HDFS-13377
> URL: https://issues.apache.org/jira/browse/HDFS-13377
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yang Yun
>Assignee: Yang Yun
>Priority: Minor
> Attachments: HDFS-13377.003.patch, HDFS-13377.004.patch, 
> HDFS-13377.005.patch, HDFS-13377.006.patch, HDFS-13377.patch, 
> HDFS-13377.patch, HDFS-13377.patch
>
>
> Currently, only  super user can set quota. That is huge burden for 
> administrator in a large system. Add a new feature to let the owner of a 
> folder also has the privilege to set quota for his sub folders. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15191) EOF when reading legacy buffer in BlockTokenIdentifier

2020-03-24 Thread Steven Rand (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand updated HDFS-15191:
---
Status: Patch Available  (was: Open)

> EOF when reading legacy buffer in BlockTokenIdentifier
> --
>
> Key: HDFS-15191
> URL: https://issues.apache.org/jira/browse/HDFS-15191
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.2.1
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Attachments: HDFS-15191-001.patch, HDFS-15191-002.patch, 
> HDFS-15191.003.patch, HDFS-15191.004.patch
>
>
> We have an HDFS client application which recently upgraded from 3.2.0 to 
> 3.2.1. After this upgrade (but not before), we sometimes see these errors 
> when this application is used with clusters still running Hadoop 2.x (more 
> specifically CDH 5.12.1):
> {code}
> WARN  [2020-02-24T00:54:32.856Z] 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: I/O error constructing 
> remote block reader. (_sampled: true)
> java.io.EOFException:
> at java.io.DataInputStream.readByte(DataInputStream.java:272)
> at 
> org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
> at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
> at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFieldsLegacy(BlockTokenIdentifier.java:240)
> at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFields(BlockTokenIdentifier.java:221)
> at 
> org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:200)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:530)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getEncryptedStreams(SaslDataTransferClient.java:342)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:276)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:245)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:227)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:170)
> at 
> org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:730)
> at 
> org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2942)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:822)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:747)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:380)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:644)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:575)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:757)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:829)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2314)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:2270)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2291)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:2246)
> at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:765)
> {code}
> We get this warning for all DataNodes with a copy of the block, so the read 
> fails.
> I haven't been able to figure out what changed between 3.2.0 and 3.2.1 to 
> cause this, but HDFS-13617 and HDFS-14611 seem related, so tagging 
> [~vagarychen] in case you have any ideas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15191) EOF when reading legacy buffer in BlockTokenIdentifier

2020-03-24 Thread Steven Rand (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand updated HDFS-15191:
---
Attachment: HDFS-15191.004.patch

> EOF when reading legacy buffer in BlockTokenIdentifier
> --
>
> Key: HDFS-15191
> URL: https://issues.apache.org/jira/browse/HDFS-15191
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.2.1
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Attachments: HDFS-15191-001.patch, HDFS-15191-002.patch, 
> HDFS-15191.003.patch, HDFS-15191.004.patch
>
>
> We have an HDFS client application which recently upgraded from 3.2.0 to 
> 3.2.1. After this upgrade (but not before), we sometimes see these errors 
> when this application is used with clusters still running Hadoop 2.x (more 
> specifically CDH 5.12.1):
> {code}
> WARN  [2020-02-24T00:54:32.856Z] 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: I/O error constructing 
> remote block reader. (_sampled: true)
> java.io.EOFException:
> at java.io.DataInputStream.readByte(DataInputStream.java:272)
> at 
> org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
> at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
> at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFieldsLegacy(BlockTokenIdentifier.java:240)
> at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFields(BlockTokenIdentifier.java:221)
> at 
> org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:200)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:530)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getEncryptedStreams(SaslDataTransferClient.java:342)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:276)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:245)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:227)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:170)
> at 
> org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:730)
> at 
> org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2942)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:822)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:747)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:380)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:644)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:575)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:757)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:829)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2314)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:2270)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2291)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:2246)
> at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:765)
> {code}
> We get this warning for all DataNodes with a copy of the block, so the read 
> fails.
> I haven't been able to figure out what changed between 3.2.0 and 3.2.1 to 
> cause this, but HDFS-13617 and HDFS-14611 seem related, so tagging 
> [~vagarychen] in case you have any ideas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15191) EOF when reading legacy buffer in BlockTokenIdentifier

2020-03-24 Thread Steven Rand (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand updated HDFS-15191:
---
Status: Open  (was: Patch Available)

> EOF when reading legacy buffer in BlockTokenIdentifier
> --
>
> Key: HDFS-15191
> URL: https://issues.apache.org/jira/browse/HDFS-15191
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.2.1
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Attachments: HDFS-15191-001.patch, HDFS-15191-002.patch, 
> HDFS-15191.003.patch
>
>
> We have an HDFS client application which recently upgraded from 3.2.0 to 
> 3.2.1. After this upgrade (but not before), we sometimes see these errors 
> when this application is used with clusters still running Hadoop 2.x (more 
> specifically CDH 5.12.1):
> {code}
> WARN  [2020-02-24T00:54:32.856Z] 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: I/O error constructing 
> remote block reader. (_sampled: true)
> java.io.EOFException:
> at java.io.DataInputStream.readByte(DataInputStream.java:272)
> at 
> org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
> at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
> at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFieldsLegacy(BlockTokenIdentifier.java:240)
> at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFields(BlockTokenIdentifier.java:221)
> at 
> org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:200)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:530)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getEncryptedStreams(SaslDataTransferClient.java:342)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:276)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:245)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:227)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:170)
> at 
> org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:730)
> at 
> org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2942)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:822)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:747)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:380)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:644)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:575)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:757)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:829)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2314)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:2270)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2291)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:2246)
> at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:765)
> {code}
> We get this warning for all DataNodes with a copy of the block, so the read 
> fails.
> I haven't been able to figure out what changed between 3.2.0 and 3.2.1 to 
> cause this, but HDFS-13617 and HDFS-14611 seem related, so tagging 
> [~vagarychen] in case you have any ideas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15191) EOF when reading legacy buffer in BlockTokenIdentifier

2020-03-24 Thread Hadoop QA (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065944#comment-17065944
 ] 

Hadoop QA commented on HDFS-15191:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 30m 
46s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
32s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 
39s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
20s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
52s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
17m  8s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  5m  
3s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
7s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
10s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
14s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  3m 
14s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 48s{color} | {color:orange} hadoop-hdfs-project: The patch generated 9 new + 
20 unchanged - 0 fixed = 29 total (was 20) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
43s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 24s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  5m 
17s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
4s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  1m 
53s{color} | {color:green} hadoop-hdfs-client in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}102m 16s{color} 
| {color:red} hadoop-hdfs in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
32s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}214m 51s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.server.datanode.TestBPOfferService |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.8 Server=19.03.8 Image:yetus/hadoop:4454c6d14b7 |
| JIRA Issue | HDFS-15191 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12997550/HDFS-15191.003.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux ecfe56bb171a 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 
08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 4454c6d |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_242 |
| findbugs

[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time

2020-03-24 Thread Hadoop QA (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065533#comment-17065533
 ] 

Hadoop QA commented on HDFS-13243:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  8s{color} 
| {color:red} HDFS-13243 does not apply to trunk. Rebase required? Wrong 
Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | HDFS-13243 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12918525/HDFS-13243-v6.patch |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/29010/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> Get CorruptBlock because of calling close and sync in same time
> ---
>
> Key: HDFS-13243
> URL: https://issues.apache.org/jira/browse/HDFS-13243
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2, 3.2.0
>Reporter: Zephyr Guo
>Assignee: Zephyr Guo
>Priority: Critical
> Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, 
> HDFS-13243-v3.patch, HDFS-13243-v4.patch, HDFS-13243-v5.patch, 
> HDFS-13243-v6.patch
>
>
> HDFS File might get broken because of corrupt block(s) that could be produced 
> by calling close and sync in the same time.
> When calling close was not successful, UCBlock status would change to 
> COMMITTED, and if a sync request gets popped from queue and processed, sync 
> operation would change the last block length.
> After that, DataNode would report all received block to NameNode, and will 
> check Block length of all COMMITTED Blocks. But the block length was already 
> different between recorded in NameNode memory and reported by DataNode, and 
> consequently, the last block is marked as corruptted because of inconsistent 
> length.
>  
> {panel:title=Log in my hdfs}
> 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, 
> truncateBlock=null, primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  for 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> fsync: 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
>  for DFSClient_NONMAPREDUCE_1077513762_1
> 2018-03-05 04:05:39,761 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in 
> file 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: 
> blockMap updated: 10.0.0.220:50010 is added to 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  size 2054413
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.219:50010 by 
> hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_108

[jira] [Updated] (HDFS-11310) Reduce the performance impact of the balancer (trunk port)

2020-03-24 Thread Akira Ajisaka (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-11310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated HDFS-11310:
-
Target Version/s: 3.4.0  (was: 3.3.0)

> Reduce the performance impact of the balancer (trunk port)
> --
>
> Key: HDFS-11310
> URL: https://issues.apache.org/jira/browse/HDFS-11310
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Affects Versions: 3.0.0-alpha1
>Reporter: Daryn Sharp
>Priority: Critical
>
> HDFS-7967 introduced a highly performant balancer getBlocks() query that 
> scales to large/dense clusters.  The simple design implementation depends on 
> the triplets data structure.  HDFS-9260 removed the triplets which 
> fundamentally changes the implementation.  Either that patch must be reverted 
> or the getBlocks() patch needs reimplementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-11310) Reduce the performance impact of the balancer (trunk port)

2020-03-24 Thread Akira Ajisaka (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-11310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated HDFS-11310:
-

As code freeze for 3.3 is crossed, moving this Jira to 3.4. Thank you.

> Reduce the performance impact of the balancer (trunk port)
> --
>
> Key: HDFS-11310
> URL: https://issues.apache.org/jira/browse/HDFS-11310
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Affects Versions: 3.0.0-alpha1
>Reporter: Daryn Sharp
>Priority: Critical
>
> HDFS-7967 introduced a highly performant balancer getBlocks() query that 
> scales to large/dense clusters.  The simple design implementation depends on 
> the triplets data structure.  HDFS-9260 removed the triplets which 
> fundamentally changes the implementation.  Either that patch must be reverted 
> or the getBlocks() patch needs reimplementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15234) Add a default method body for the INodeAttributeProvider#checkPermissionWithContext API

2020-03-24 Thread Akira Ajisaka (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated HDFS-15234:
-
Status: Patch Available  (was: Open)

> Add a default method body for the 
> INodeAttributeProvider#checkPermissionWithContext API
> ---
>
> Key: HDFS-15234
> URL: https://issues.apache.org/jira/browse/HDFS-15234
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.0
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
>Priority: Blocker
>
> The new API INodeAttributeProvider#checkPermissionWithContext() needs a 
> default method body. Otherwise old implementations fail to compile.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Created] (HDFS-15238) RBF：NamenodeHeartbeatService caused memory to grow rapidly

2020-03-24 Thread xuzq (Jira)

xuzq created HDFS-15238:
---

 Summary: RBF：NamenodeHeartbeatService caused memory to grow rapidly
 Key: HDFS-15238
 URL: https://issues.apache.org/jira/browse/HDFS-15238
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: xuzq
Assignee: xuzq


NamenodeHeartbeatService will get NameNode's HA status every 5s, and created 
HAServiceProtocol every time.

When creating HAServiceProtocol, it also will new Configuration.
Over time, there will be more and more entries for REGISTER in Configuration 
until fullGc happen. 
The entry will piles up again, after reaching a certain threshold,  the fullGc 
is triggered again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-8893) DNs with failed volumes stop serving during rolling upgrade

2020-03-24 Thread Akira Ajisaka (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated HDFS-8893:

Target Version/s: 3.4.0  (was: 3.3.0)

Moved to 3.4.0.

> DNs with failed volumes stop serving during rolling upgrade
> ---
>
> Key: HDFS-8893
> URL: https://issues.apache.org/jira/browse/HDFS-8893
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Rushabh Shah
>Priority: Critical
>
> When a rolling upgrade starts, all DNs try to write a rolling_upgrade marker 
> to each of their volumes. If one of the volumes is bad, this will fail. When 
> this failure happens, the DN does not update the key it received from the NN.
> Unfortunately we had one failed volume on all the 3 datanodes which were 
> having replica.
> Keys expire after 20 hours so at about 20 hours into the rolling upgrade, the 
> DNs with failed volumes will stop serving clients.
> Here is the stack trace on the datanode size:
> {noformat}
> 2015-08-11 07:32:28,827 [DataNode: heartbeating to 8020] WARN 
> datanode.DataNode: IOException in offerService
> java.io.IOException: Read-only file system
> at java.io.UnixFileSystem.createFileExclusively(Native Method)
> at java.io.File.createNewFile(File.java:947)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.setRollingUpgradeMarkers(BlockPoolSliceStorage.java:721)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.setRollingUpgradeMarker(DataStorage.java:173)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.setRollingUpgradeMarker(FsDatasetImpl.java:2357)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.signalRollingUpgrade(BPOfferService.java:480)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.handleRollingUpgradeStatus(BPServiceActor.java:626)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:677)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:833)
> at java.lang.Thread.run(Thread.java:722)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time

2020-03-24 Thread Akira Ajisaka (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated HDFS-13243:
-
Target Version/s: 3.4.0  (was: 3.3.0)

As code freeze for 3.3 is crossed, moving this Jira to 3.4. Please feel free to 
revert if anyone has concerns. Thank you.

> Get CorruptBlock because of calling close and sync in same time
> ---
>
> Key: HDFS-13243
> URL: https://issues.apache.org/jira/browse/HDFS-13243
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2, 3.2.0
>Reporter: Zephyr Guo
>Assignee: Zephyr Guo
>Priority: Critical
> Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, 
> HDFS-13243-v3.patch, HDFS-13243-v4.patch, HDFS-13243-v5.patch, 
> HDFS-13243-v6.patch
>
>
> HDFS File might get broken because of corrupt block(s) that could be produced 
> by calling close and sync in the same time.
> When calling close was not successful, UCBlock status would change to 
> COMMITTED, and if a sync request gets popped from queue and processed, sync 
> operation would change the last block length.
> After that, DataNode would report all received block to NameNode, and will 
> check Block length of all COMMITTED Blocks. But the block length was already 
> different between recorded in NameNode memory and reported by DataNode, and 
> consequently, the last block is marked as corruptted because of inconsistent 
> length.
>  
> {panel:title=Log in my hdfs}
> 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, 
> truncateBlock=null, primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  for 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> fsync: 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
>  for DFSClient_NONMAPREDUCE_1077513762_1
> 2018-03-05 04:05:39,761 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in 
> file 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: 
> blockMap updated: 10.0.0.220:50010 is added to 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  size 2054413
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.219:50010 by 
> hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.218:50010 by 
> hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:40,162 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLET

[jira] [Updated] (HDFS-15191) EOF when reading legacy buffer in BlockTokenIdentifier

2020-03-24 Thread Steven Rand (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand updated HDFS-15191:
---
Status: Open  (was: Patch Available)

> EOF when reading legacy buffer in BlockTokenIdentifier
> --
>
> Key: HDFS-15191
> URL: https://issues.apache.org/jira/browse/HDFS-15191
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.2.1
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Attachments: HDFS-15191-001.patch, HDFS-15191-002.patch, 
> HDFS-15191.003.patch
>
>
> We have an HDFS client application which recently upgraded from 3.2.0 to 
> 3.2.1. After this upgrade (but not before), we sometimes see these errors 
> when this application is used with clusters still running Hadoop 2.x (more 
> specifically CDH 5.12.1):
> {code}
> WARN  [2020-02-24T00:54:32.856Z] 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: I/O error constructing 
> remote block reader. (_sampled: true)
> java.io.EOFException:
> at java.io.DataInputStream.readByte(DataInputStream.java:272)
> at 
> org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
> at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
> at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFieldsLegacy(BlockTokenIdentifier.java:240)
> at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFields(BlockTokenIdentifier.java:221)
> at 
> org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:200)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:530)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getEncryptedStreams(SaslDataTransferClient.java:342)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:276)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:245)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:227)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:170)
> at 
> org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:730)
> at 
> org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2942)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:822)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:747)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:380)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:644)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:575)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:757)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:829)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2314)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:2270)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2291)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:2246)
> at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:765)
> {code}
> We get this warning for all DataNodes with a copy of the block, so the read 
> fails.
> I haven't been able to figure out what changed between 3.2.0 and 3.2.1 to 
> cause this, but HDFS-13617 and HDFS-14611 seem related, so tagging 
> [~vagarychen] in case you have any ideas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15191) EOF when reading legacy buffer in BlockTokenIdentifier

2020-03-24 Thread Steven Rand (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand updated HDFS-15191:
---
Status: Patch Available  (was: Open)

> EOF when reading legacy buffer in BlockTokenIdentifier
> --
>
> Key: HDFS-15191
> URL: https://issues.apache.org/jira/browse/HDFS-15191
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.2.1
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Attachments: HDFS-15191-001.patch, HDFS-15191-002.patch, 
> HDFS-15191.003.patch
>
>
> We have an HDFS client application which recently upgraded from 3.2.0 to 
> 3.2.1. After this upgrade (but not before), we sometimes see these errors 
> when this application is used with clusters still running Hadoop 2.x (more 
> specifically CDH 5.12.1):
> {code}
> WARN  [2020-02-24T00:54:32.856Z] 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: I/O error constructing 
> remote block reader. (_sampled: true)
> java.io.EOFException:
> at java.io.DataInputStream.readByte(DataInputStream.java:272)
> at 
> org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
> at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
> at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFieldsLegacy(BlockTokenIdentifier.java:240)
> at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFields(BlockTokenIdentifier.java:221)
> at 
> org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:200)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:530)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getEncryptedStreams(SaslDataTransferClient.java:342)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:276)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:245)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:227)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:170)
> at 
> org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:730)
> at 
> org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2942)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:822)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:747)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:380)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:644)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:575)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:757)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:829)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2314)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:2270)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2291)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:2246)
> at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:765)
> {code}
> We get this warning for all DataNodes with a copy of the block, so the read 
> fails.
> I haven't been able to figure out what changed between 3.2.0 and 3.2.1 to 
> cause this, but HDFS-13617 and HDFS-14611 seem related, so tagging 
> [~vagarychen] in case you have any ideas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15191) EOF when reading legacy buffer in BlockTokenIdentifier

2020-03-24 Thread Steven Rand (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand updated HDFS-15191:
---
Attachment: HDFS-15191.003.patch

> EOF when reading legacy buffer in BlockTokenIdentifier
> --
>
> Key: HDFS-15191
> URL: https://issues.apache.org/jira/browse/HDFS-15191
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.2.1
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Attachments: HDFS-15191-001.patch, HDFS-15191-002.patch, 
> HDFS-15191.003.patch
>
>
> We have an HDFS client application which recently upgraded from 3.2.0 to 
> 3.2.1. After this upgrade (but not before), we sometimes see these errors 
> when this application is used with clusters still running Hadoop 2.x (more 
> specifically CDH 5.12.1):
> {code}
> WARN  [2020-02-24T00:54:32.856Z] 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: I/O error constructing 
> remote block reader. (_sampled: true)
> java.io.EOFException:
> at java.io.DataInputStream.readByte(DataInputStream.java:272)
> at 
> org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
> at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
> at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFieldsLegacy(BlockTokenIdentifier.java:240)
> at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFields(BlockTokenIdentifier.java:221)
> at 
> org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:200)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:530)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getEncryptedStreams(SaslDataTransferClient.java:342)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:276)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:245)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:227)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:170)
> at 
> org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:730)
> at 
> org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2942)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:822)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:747)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:380)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:644)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:575)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:757)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:829)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2314)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:2270)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2291)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:2246)
> at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:765)
> {code}
> We get this warning for all DataNodes with a copy of the block, so the read 
> fails.
> I haven't been able to figure out what changed between 3.2.0 and 3.2.1 to 
> cause this, but HDFS-13617 and HDFS-14611 seem related, so tagging 
> [~vagarychen] in case you have any ideas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15235) Transient network failure during NameNode failover makes cluster unavailable

2020-03-24 Thread YCozy (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YCozy updated HDFS-15235:
-
Description: 
We have an HA cluster with two NameNodes: an active NN1 and a standby NN2. At 
some point, NN1 becomes unhealthy and the admin tries to manually failover to 
NN2 by running command
{code:java}
$ hdfs haadmin -failover NN1 NN2
{code}
NN2 receives the request and becomes active:
{code:java}
2020-03-24 00:24:56,412 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Stopping services started 
for standby state
2020-03-24 00:24:56,413 WARN 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Edit log tailer 
interrupted: sleep interrupted
2020-03-24 00:24:56,415 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required 
for active state
2020-03-24 00:24:56,417 INFO 
org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Recovering 
unfinalized segments in /app/ha-name-dir-shared/current
2020-03-24 00:24:56,419 INFO 
org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Recovering 
unfinalized segments in /app/nn2/name/current
2020-03-24 00:24:56,419 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Catching up to latest 
edits from old active before taking over writer role in edits logs
2020-03-24 00:24:56,435 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
Reading 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@7c3095fa 
expecting start txid #1
2020-03-24 00:24:56,436 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
Start loading edits file 
/app/ha-name-dir-shared/current/edits_001-019 
maxTxnsToRead = 9223372036854775807
2020-03-24 00:24:56,441 INFO 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
Fast-forwarding stream 
'/app/ha-name-dir-shared/current/edits_001-019' 
to transaction ID 1
2020-03-24 00:24:56,567 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
Loaded 1 edits file(s) (the last named 
/app/ha-name-dir-shared/current/edits_001-019) 
of total size 1305.0, total edits 19.0, total load time 109.0 ms
2020-03-24 00:24:56,567 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Marking all 
datanodes as stale
2020-03-24 00:24:56,568 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Processing 4 
messages from DataNodes that were previously queued during standby state
2020-03-24 00:24:56,569 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Reprocessing replication 
and invalidation queues
2020-03-24 00:24:56,569 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: initializing 
replication queues
2020-03-24 00:24:56,570 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Will take over writing 
edit logs at txnid 20
2020-03-24 00:24:56,571 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: 
Starting log segment at 20
2020-03-24 00:24:56,812 INFO 
org.apache.hadoop.hdfs.server.namenode.FSDirectory: Initializing quota with 4 
thread(s)
2020-03-24 00:24:56,819 INFO 
org.apache.hadoop.hdfs.server.namenode.FSDirectory: Quota initialization 
completed in 6 millisecondsname space=3storage space=24690storage 
types=RAM_DISK=0, SSD=0, DISK=0, ARCHIVE=0, PROVIDED=0
2020-03-24 00:24:56,827 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Starting 
CacheReplicationMonitor with interval 3 milliseconds
{code}
But NN2 fails to send back the RPC response because of temporary network 
partitioning.
{code:java}
java.io.EOFException: End of File Exception between local host is: 
"24e7b5a52e85/172.17.0.2"; destination host is: "127.0.0.3":8180; : 
java.io.EOFException; For more details see:  
http://wiki.apache.org/hadoop/EOFException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:837)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:791)
        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1597)
        at org.apache.hadoop.ipc.Client.call(Client.java:1539)
        at org.apache.hadoop.ipc.Client.call(Client.java:1436)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
        at com.sun.proxy.$Proxy8.transitionToActive(Unknown Source)
        at 
org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToActive(HAServiceProtocolClien

[jira] [Updated] (HDFS-15191) EOF when reading legacy buffer in BlockTokenIdentifier

2020-03-24 Thread Steven Rand (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand updated HDFS-15191:
---
Status: Patch Available  (was: Open)

> EOF when reading legacy buffer in BlockTokenIdentifier
> --
>
> Key: HDFS-15191
> URL: https://issues.apache.org/jira/browse/HDFS-15191
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.2.1
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Attachments: HDFS-15191-001.patch, HDFS-15191-002.patch
>
>
> We have an HDFS client application which recently upgraded from 3.2.0 to 
> 3.2.1. After this upgrade (but not before), we sometimes see these errors 
> when this application is used with clusters still running Hadoop 2.x (more 
> specifically CDH 5.12.1):
> {code}
> WARN  [2020-02-24T00:54:32.856Z] 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: I/O error constructing 
> remote block reader. (_sampled: true)
> java.io.EOFException:
> at java.io.DataInputStream.readByte(DataInputStream.java:272)
> at 
> org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
> at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
> at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFieldsLegacy(BlockTokenIdentifier.java:240)
> at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFields(BlockTokenIdentifier.java:221)
> at 
> org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:200)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:530)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getEncryptedStreams(SaslDataTransferClient.java:342)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:276)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:245)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:227)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:170)
> at 
> org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:730)
> at 
> org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2942)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:822)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:747)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:380)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:644)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:575)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:757)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:829)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2314)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:2270)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2291)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:2246)
> at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:765)
> {code}
> We get this warning for all DataNodes with a copy of the block, so the read 
> fails.
> I haven't been able to figure out what changed between 3.2.0 and 3.2.1 to 
> cause this, but HDFS-13617 and HDFS-14611 seem related, so tagging 
> [~vagarychen] in case you have any ideas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15191) EOF when reading legacy buffer in BlockTokenIdentifier

2020-03-24 Thread Steven Rand (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand updated HDFS-15191:
---
Status: Open  (was: Patch Available)

> EOF when reading legacy buffer in BlockTokenIdentifier
> --
>
> Key: HDFS-15191
> URL: https://issues.apache.org/jira/browse/HDFS-15191
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.2.1
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Attachments: HDFS-15191-001.patch, HDFS-15191-002.patch
>
>
> We have an HDFS client application which recently upgraded from 3.2.0 to 
> 3.2.1. After this upgrade (but not before), we sometimes see these errors 
> when this application is used with clusters still running Hadoop 2.x (more 
> specifically CDH 5.12.1):
> {code}
> WARN  [2020-02-24T00:54:32.856Z] 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: I/O error constructing 
> remote block reader. (_sampled: true)
> java.io.EOFException:
> at java.io.DataInputStream.readByte(DataInputStream.java:272)
> at 
> org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
> at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
> at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFieldsLegacy(BlockTokenIdentifier.java:240)
> at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFields(BlockTokenIdentifier.java:221)
> at 
> org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:200)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:530)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getEncryptedStreams(SaslDataTransferClient.java:342)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:276)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:245)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:227)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:170)
> at 
> org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:730)
> at 
> org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2942)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:822)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:747)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:380)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:644)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:575)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:757)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:829)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2314)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:2270)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2291)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:2246)
> at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:765)
> {code}
> We get this warning for all DataNodes with a copy of the block, so the read 
> fails.
> I haven't been able to figure out what changed between 3.2.0 and 3.2.1 to 
> cause this, but HDFS-13617 and HDFS-14611 seem related, so tagging 
> [~vagarychen] in case you have any ideas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15180) DataNode FsDatasetImpl Fine-Grained Locking via BlockPool.

2020-03-24 Thread Aiphago (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065536#comment-17065536
 ] 

Aiphago commented on HDFS-15180:


ping [~sodonnell] ,  [~linyiqun], [~weichiu] Any advice ?Thanks.
 

>  DataNode FsDatasetImpl Fine-Grained Locking via BlockPool.
> ---
>
> Key: HDFS-15180
> URL: https://issues.apache.org/jira/browse/HDFS-15180
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: Aiphago
>Priority: Major
> Attachments: HDFS-15180.001.patch, HDFS-15180.002.patch, 
> HDFS-15180.003.patch, HDFS-15180.004.patch, 
> image-2020-03-10-17-22-57-391.png, image-2020-03-10-17-31-58-830.png, 
> image-2020-03-10-17-34-26-368.png
>
>
> Now the FsDatasetImpl datasetLock is heavy, when their are many namespaces in 
> big cluster. If we can split the FsDatasetImpl datasetLock via blockpool. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15238) RBF：NamenodeHeartbeatService caused memory to grow rapidly

2020-03-24 Thread xuzq (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuzq updated HDFS-15238:

Attachment: HDFS-15238-trunk-001.patch
Status: Patch Available  (was: Open)

> RBF：NamenodeHeartbeatService caused memory to grow rapidly
> --
>
> Key: HDFS-15238
> URL: https://issues.apache.org/jira/browse/HDFS-15238
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: xuzq
>Assignee: xuzq
>Priority: Major
> Attachments: HDFS-15238-trunk-001.patch
>
>
> NamenodeHeartbeatService will get NameNode's HA status every 5s, and created 
> HAServiceProtocol every time.
> When creating HAServiceProtocol, it also will new Configuration.
> Over time, there will be more and more entries for REGISTER in Configuration 
> until fullGc happen. 
> The entry will piles up again, after reaching a certain threshold,  the 
> fullGc is triggered again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15237) Get checksum of EC file failed, when some block is missing or corrupt

2020-03-24 Thread zhengchenyu (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065509#comment-17065509
 ] 

zhengchenyu commented on HDFS-15237:


[~ayushtkn] Hadoop-3.2.1

> Get checksum of EC file failed, when some block is missing or corrupt
> -
>
> Key: HDFS-15237
> URL: https://issues.apache.org/jira/browse/HDFS-15237
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, hdfs
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Priority: Major
> Fix For: 3.2.2
>
>
> When we distcp from an ec directory to another one, I found some error like 
> this.
> {code}
> 2020-03-20 20:18:21,366 WARN [main] 
> org.apache.hadoop.hdfs.FileChecksumHelper: src=/EC/6-3//000325_0, 
> datanodes[6]=DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK]2020-03-20
>  20:18:21,366 WARN [main] org.apache.hadoop.hdfs.FileChecksumHelper: 
> src=/EC/6-3//000325_0, 
> datanodes[6]=DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK]java.io.EOFException:
>  Unexpected EOF while trying to read response from server at 
> org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:550)
>  at 
> org.apache.hadoop.hdfs.FileChecksumHelper$StripedFileNonStripedChecksumComputer.tryDatanode(FileChecksumHelper.java:709)
>  at 
> org.apache.hadoop.hdfs.FileChecksumHelper$StripedFileNonStripedChecksumComputer.checksumBlockGroup(FileChecksumHelper.java:664)
>  at 
> org.apache.hadoop.hdfs.FileChecksumHelper$StripedFileNonStripedChecksumComputer.checksumBlocks(FileChecksumHelper.java:638)
>  at 
> org.apache.hadoop.hdfs.FileChecksumHelper$FileChecksumComputer.compute(FileChecksumHelper.java:252)
>  at 
> org.apache.hadoop.hdfs.DFSClient.getFileChecksumInternal(DFSClient.java:1790) 
> at 
> org.apache.hadoop.hdfs.DFSClient.getFileChecksumWithCombineMode(DFSClient.java:1810)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem$33.doCall(DistributedFileSystem.java:1691)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem$33.doCall(DistributedFileSystem.java:1688)
>  at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileChecksum(DistributedFileSystem.java:1700)
>  at 
> org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:138)
>  at 
> org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:115)
>  at 
> org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)
>  at 
> org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:259)
>  at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:220) at 
> org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:48) at 
> org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) at 
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799) at 
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:347) at 
> org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
> {code}
> And Then I found some error in datanode like this
> {code}
> 2020-03-20 20:54:16,573 INFO 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient: 
> SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = 
> false
> 2020-03-20 20:54:16,577 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: 
> bd-hadoop-128050.zeus.lianjia.com:9866:DataXceiver error processing 
> BLOCK_GROUP_CHECKSUM operation src: /10.201.1.38:33264 dst: 
> /10.200.128.50:9866
> java.lang.UnsupportedOperationException
>  at java.nio.ByteBuffer.array(ByteBuffer.java:994)
>  at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockChecksumReconstructor.reconstruct(StripedBlockChecksumReconstructor.java:90)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockChecksumHelper$BlockGroupNonStripedChecksumComputer.recalculateChecksum(BlockChecksumHelper.java:711)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockChecksumHelper$BlockGroupNonStripedChecksumComputer.compute(BlockChecksumHelper.java:489)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.blockGroupChecksum(DataXceiver.java:1047)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opStripedBlockChecksum(Receiver.java:327)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:119)
>  at 
> org.apache.hadoop.hdfs.server.da

[jira] [Commented] (HDFS-15237) Get checksum of EC file failed, when some block is missing or corrupt

2020-03-24 Thread Ayush Saxena (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065508#comment-17065508
 ] 

Ayush Saxena commented on HDFS-15237:
-

What is the Hadoop version?

> Get checksum of EC file failed, when some block is missing or corrupt
> -
>
> Key: HDFS-15237
> URL: https://issues.apache.org/jira/browse/HDFS-15237
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, hdfs
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Priority: Major
> Fix For: 3.2.2
>
>
> When we distcp from an ec directory to another one, I found some error like 
> this.
> {code}
> 2020-03-20 20:18:21,366 WARN [main] 
> org.apache.hadoop.hdfs.FileChecksumHelper: src=/EC/6-3//000325_0, 
> datanodes[6]=DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK]2020-03-20
>  20:18:21,366 WARN [main] org.apache.hadoop.hdfs.FileChecksumHelper: 
> src=/EC/6-3//000325_0, 
> datanodes[6]=DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK]java.io.EOFException:
>  Unexpected EOF while trying to read response from server at 
> org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:550)
>  at 
> org.apache.hadoop.hdfs.FileChecksumHelper$StripedFileNonStripedChecksumComputer.tryDatanode(FileChecksumHelper.java:709)
>  at 
> org.apache.hadoop.hdfs.FileChecksumHelper$StripedFileNonStripedChecksumComputer.checksumBlockGroup(FileChecksumHelper.java:664)
>  at 
> org.apache.hadoop.hdfs.FileChecksumHelper$StripedFileNonStripedChecksumComputer.checksumBlocks(FileChecksumHelper.java:638)
>  at 
> org.apache.hadoop.hdfs.FileChecksumHelper$FileChecksumComputer.compute(FileChecksumHelper.java:252)
>  at 
> org.apache.hadoop.hdfs.DFSClient.getFileChecksumInternal(DFSClient.java:1790) 
> at 
> org.apache.hadoop.hdfs.DFSClient.getFileChecksumWithCombineMode(DFSClient.java:1810)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem$33.doCall(DistributedFileSystem.java:1691)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem$33.doCall(DistributedFileSystem.java:1688)
>  at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileChecksum(DistributedFileSystem.java:1700)
>  at 
> org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:138)
>  at 
> org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:115)
>  at 
> org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)
>  at 
> org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:259)
>  at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:220) at 
> org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:48) at 
> org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) at 
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799) at 
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:347) at 
> org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
> {code}
> And Then I found some error in datanode like this
> {code}
> 2020-03-20 20:54:16,573 INFO 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient: 
> SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = 
> false
> 2020-03-20 20:54:16,577 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: 
> bd-hadoop-128050.zeus.lianjia.com:9866:DataXceiver error processing 
> BLOCK_GROUP_CHECKSUM operation src: /10.201.1.38:33264 dst: 
> /10.200.128.50:9866
> java.lang.UnsupportedOperationException
>  at java.nio.ByteBuffer.array(ByteBuffer.java:994)
>  at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockChecksumReconstructor.reconstruct(StripedBlockChecksumReconstructor.java:90)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockChecksumHelper$BlockGroupNonStripedChecksumComputer.recalculateChecksum(BlockChecksumHelper.java:711)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockChecksumHelper$BlockGroupNonStripedChecksumComputer.compute(BlockChecksumHelper.java:489)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.blockGroupChecksum(DataXceiver.java:1047)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opStripedBlockChecksum(Receiver.java:327)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:119)
>  at 
> org.apache.hadoop.hdfs.serv

[jira] [Commented] (HDFS-15201) SnapshotCounter hits MaxSnapshotID limit

2020-03-24 Thread Hudson (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065484#comment-17065484
 ] 

Hudson commented on HDFS-15201:
---

FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #18078 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/18078/])
HDFS-15201 SnapshotCounter hits MaxSnapshotID limit (#1870) (github: rev 
5250cd6db3a6b7abeb5c9a0a4059f1d95986c07b)
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/snapshot/SnapshotManager.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/snapshot/TestSnapshotManager.java


> SnapshotCounter hits MaxSnapshotID limit
> 
>
> Key: HDFS-15201
> URL: https://issues.apache.org/jira/browse/HDFS-15201
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: snapshots
>Reporter: Karthik Palanisamy
>Assignee: Karthik Palanisamy
>Priority: Major
>
> Users reported that they are unable to take HDFS snapshots and their 
> snapshotCounter hits MaxSnapshotID limit. MaxSnapshotID limit is 16777215.
> {code:java}
> SnapshotManager.java
> private static final int SNAPSHOT_ID_BIT_WIDTH = 24;
> /**
>  * Returns the maximum allowable snapshot ID based on the bit width of the
>  * snapshot ID.
>  *
>  * @return maximum allowable snapshot ID.
>  */
>  public int getMaxSnapshotID() {
>  return ((1 << SNAPSHOT_ID_BIT_WIDTH) - 1);
> }
> {code}
>  
> I think, SNAPSHOT_ID_BIT_WIDTH is too low. May be good idea to increase 
> SNAPSHOT_ID_BIT_WIDTH to 31? to aline with our CURRENT_STATE_ID limit 
> (Integer.MAX_VALUE - 1).
>  
> {code:java}
> /**
>  * This id is used to indicate the current state (vs. snapshots)
>  */
> public static final int CURRENT_STATE_ID = Integer.MAX_VALUE - 1;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-15201) SnapshotCounter hits MaxSnapshotID limit

2020-03-24 Thread Lokesh Jain (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain resolved HDFS-15201.

Resolution: Fixed

> SnapshotCounter hits MaxSnapshotID limit
> 
>
> Key: HDFS-15201
> URL: https://issues.apache.org/jira/browse/HDFS-15201
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: snapshots
>Reporter: Karthik Palanisamy
>Assignee: Karthik Palanisamy
>Priority: Major
>
> Users reported that they are unable to take HDFS snapshots and their 
> snapshotCounter hits MaxSnapshotID limit. MaxSnapshotID limit is 16777215.
> {code:java}
> SnapshotManager.java
> private static final int SNAPSHOT_ID_BIT_WIDTH = 24;
> /**
>  * Returns the maximum allowable snapshot ID based on the bit width of the
>  * snapshot ID.
>  *
>  * @return maximum allowable snapshot ID.
>  */
>  public int getMaxSnapshotID() {
>  return ((1 << SNAPSHOT_ID_BIT_WIDTH) - 1);
> }
> {code}
>  
> I think, SNAPSHOT_ID_BIT_WIDTH is too low. May be good idea to increase 
> SNAPSHOT_ID_BIT_WIDTH to 31? to aline with our CURRENT_STATE_ID limit 
> (Integer.MAX_VALUE - 1).
>  
> {code:java}
> /**
>  * This id is used to indicate the current state (vs. snapshots)
>  */
> public static final int CURRENT_STATE_ID = Integer.MAX_VALUE - 1;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-15237) Get checksum of EC file failed, when some block is missing or corrupt

2020-03-24 Thread zhengchenyu (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065461#comment-17065461
 ] 

zhengchenyu edited comment on HDFS-15237 at 3/24/20, 9:43 AM:
--

Here comment some strange phenomenon. Why this error happen? Because I found 
some wrong internal block distribution like this: 
{code:java}
0. BP-1936287042-10.200.128.33-1573194961291:blk_-9223372036797335472_3783205 
len=247453749 Live_repl=8 
[blk_-9223372036797335472:DatanodeInfoWithStorage[10.200.128.43:9866,DS-2ddde0b8-6a84-4d06-8a40-d4ae5691e81c,DISK],
 
 
blk_-9223372036797335471:DatanodeInfoWithStorage[10.200.128.41:9866,DS-a4fc5486-6c45-481e-84e7-9393eeaf1313,DISK],
 
 
blk_-9223372036797335470:DatanodeInfoWithStorage[10.200.128.50:9866,DS-fc0632c6-8916-42d8-8219-57b022bb2786,DISK],
 
 
blk_-9223372036797335469:DatanodeInfoWithStorage[10.200.128.54:9866,DS-1b6cb52a-f55a-4ef8-beaf-a5d7b7fe93aa,DISK],
 
 
blk_-9223372036797335467:DatanodeInfoWithStorage[10.200.128.52:9866,DS-fc6e00dd-ca5a-4580-9403-aeb6906da81a,DISK],
 
 
blk_-9223372036797335466:DatanodeInfoWithStorage[10.200.128.53:9866,DS-2c926a3b-64c0-441b-abe2-188e79918abe,DISK],
 
 
blk_-9223372036797335465:DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK],
 
 
blk_-9223372036797335464:DatanodeInfoWithStorage[10.200.128.44:9866,DS-3725af76-fe86-4f97-9740-d77bfa339b3f,DISK],
 
 
blk_-9223372036797335470:DatanodeInfoWithStorage[10.200.128.45:9866,DS-250fd4cf-705f-4cb5-bc3a-c7a105247e35,DISK]]

{code}
this is the result of hdfs fsck. Your can see this block group has 9 internal 
block, but no blk_-9223372036797335468, two repeated blk_-9223372036797335470. 
Through the distribution is wrong, the blcok group has 9 internal block, so the 
replicatorMonitor can't repair this error. I think this may be another issue. 

this block is too old so that the log is missing, so I don't know the reason, 
and can't reproduction this error now. 


was (Author: zhengchenyu):
Here comment some strange phenomenon. Why this error happen? Because I found 
some wrong internal block distribution like this: 
{code:java}
0. BP-1936287042-10.200.128.33-1573194961291:blk_-9223372036797335472_3783205 
len=247453749 Live_repl=8 
[blk_-9223372036797335472:DatanodeInfoWithStorage[10.200.128.43:9866,DS-2ddde0b8-6a84-4d06-8a40-d4ae5691e81c,DISK],
 
 
blk_-9223372036797335471:DatanodeInfoWithStorage[10.200.128.41:9866,DS-a4fc5486-6c45-481e-84e7-9393eeaf1313,DISK],
 
 
blk_-9223372036797335470:DatanodeInfoWithStorage[10.200.128.50:9866,DS-fc0632c6-8916-42d8-8219-57b022bb2786,DISK],
 
 
blk_-9223372036797335469:DatanodeInfoWithStorage[10.200.128.54:9866,DS-1b6cb52a-f55a-4ef8-beaf-a5d7b7fe93aa,DISK],
 
 
blk_-9223372036797335467:DatanodeInfoWithStorage[10.200.128.52:9866,DS-fc6e00dd-ca5a-4580-9403-aeb6906da81a,DISK],
 
 
blk_-9223372036797335466:DatanodeInfoWithStorage[10.200.128.53:9866,DS-2c926a3b-64c0-441b-abe2-188e79918abe,DISK],
 
 
blk_-9223372036797335465:DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK],
 
 
blk_-9223372036797335464:DatanodeInfoWithStorage[10.200.128.44:9866,DS-3725af76-fe86-4f97-9740-d77bfa339b3f,DISK],
 
 
blk_-9223372036797335470:DatanodeInfoWithStorage[10.200.128.45:9866,DS-250fd4cf-705f-4cb5-bc3a-c7a105247e35,DISK]]

{code}
this is the result of hdfs fsck. Your can see this block group has 9 internal 
block, but no blk_-9223372036797335468, two repeated blk_-9223372036797335470. 

this block is too old so that the log is missing, so I don't know the reason, 
and can't reproduction this error now. 

> Get checksum of EC file failed, when some block is missing or corrupt
> -
>
> Key: HDFS-15237
> URL: https://issues.apache.org/jira/browse/HDFS-15237
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, hdfs
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Priority: Major
> Fix For: 3.2.2
>
>
> When we distcp from an ec directory to another one, I found some error like 
> this.
> {code}
> 2020-03-20 20:18:21,366 WARN [main] 
> org.apache.hadoop.hdfs.FileChecksumHelper: src=/EC/6-3//000325_0, 
> datanodes[6]=DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK]2020-03-20
>  20:18:21,366 WARN [main] org.apache.hadoop.hdfs.FileChecksumHelper: 
> src=/EC/6-3//000325_0, 
> datanodes[6]=DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK]java.io.EOFException:
>  Unexpected EOF while trying to read response from server at 
> org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:550)
>  at 
> org.apache.hadoop.hdfs.FileChecksumHelper$StripedFileNonStripedChecksumComputer.tryDatanode(FileChecksumHelper.java:709)

[jira] [Comment Edited] (HDFS-15237) Get checksum of EC file failed, when some block is missing or corrupt

2020-03-24 Thread zhengchenyu (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065461#comment-17065461
 ] 

zhengchenyu edited comment on HDFS-15237 at 3/24/20, 9:42 AM:
--

Here comment some strange phenomenon. Why this error happen? Because I found 
some wrong internal block distribution like this: 
{code:java}
0. BP-1936287042-10.200.128.33-1573194961291:blk_-9223372036797335472_3783205 
len=247453749 Live_repl=8 
[blk_-9223372036797335472:DatanodeInfoWithStorage[10.200.128.43:9866,DS-2ddde0b8-6a84-4d06-8a40-d4ae5691e81c,DISK],
 
 
blk_-9223372036797335471:DatanodeInfoWithStorage[10.200.128.41:9866,DS-a4fc5486-6c45-481e-84e7-9393eeaf1313,DISK],
 
 
blk_-9223372036797335470:DatanodeInfoWithStorage[10.200.128.50:9866,DS-fc0632c6-8916-42d8-8219-57b022bb2786,DISK],
 
 
blk_-9223372036797335469:DatanodeInfoWithStorage[10.200.128.54:9866,DS-1b6cb52a-f55a-4ef8-beaf-a5d7b7fe93aa,DISK],
 
 
blk_-9223372036797335467:DatanodeInfoWithStorage[10.200.128.52:9866,DS-fc6e00dd-ca5a-4580-9403-aeb6906da81a,DISK],
 
 
blk_-9223372036797335466:DatanodeInfoWithStorage[10.200.128.53:9866,DS-2c926a3b-64c0-441b-abe2-188e79918abe,DISK],
 
 
blk_-9223372036797335465:DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK],
 
 
blk_-9223372036797335464:DatanodeInfoWithStorage[10.200.128.44:9866,DS-3725af76-fe86-4f97-9740-d77bfa339b3f,DISK],
 
 
blk_-9223372036797335470:DatanodeInfoWithStorage[10.200.128.45:9866,DS-250fd4cf-705f-4cb5-bc3a-c7a105247e35,DISK]]

{code}
this is the result of hdfs fsck. Your can see this block group has 9 internal 
block, but no blk_-9223372036797335468, two repeated blk_-9223372036797335470. 

this block is too old so that the log is missing, so I don't know the reason, 
and can't reproduction this error now. 


was (Author: zhengchenyu):
Here comment some strange phenomenon. Why this error happen? Because I found 
some wrong internal block distribution like this: 

{code}

0. BP-1936287042-10.200.128.33-1573194961291:blk_-9223372036797335472_3783205 
len=247453749 Live_repl=8 
[blk_-9223372036797335472:DatanodeInfoWithStorage[10.200.128.43:9866,DS-2ddde0b8-6a84-4d06-8a40-d4ae5691e81c,DISK],
 
 
blk_-9223372036797335471:DatanodeInfoWithStorage[10.200.128.41:9866,DS-a4fc5486-6c45-481e-84e7-9393eeaf1313,DISK],
 
 
blk_-9223372036797335470:DatanodeInfoWithStorage[10.200.128.50:9866,DS-fc0632c6-8916-42d8-8219-57b022bb2786,DISK],
 
 
blk_-9223372036797335469:DatanodeInfoWithStorage[10.200.128.54:9866,DS-1b6cb52a-f55a-4ef8-beaf-a5d7b7fe93aa,DISK],
 
 
blk_-9223372036797335467:DatanodeInfoWithStorage[10.200.128.52:9866,DS-fc6e00dd-ca5a-4580-9403-aeb6906da81a,DISK],
 
 
blk_-9223372036797335466:DatanodeInfoWithStorage[10.200.128.53:9866,DS-2c926a3b-64c0-441b-abe2-188e79918abe,DISK],
 
 
blk_-9223372036797335465:DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK],
 
 
blk_-9223372036797335464:DatanodeInfoWithStorage[10.200.128.44:9866,DS-3725af76-fe86-4f97-9740-d77bfa339b3f,DISK],
 
 
blk_-9223372036797335470:DatanodeInfoWithStorage[10.200.128.45:9866,DS-250fd4cf-705f-4cb5-bc3a-c7a105247e35,DISK]]

{code}

this is the result of hdfs fsck. Your can see this block group has 9 internal 
block, but no blk_-9223372036797335468, two repeated blk_-9223372036797335480. 

this block is too old so that the log is missing, so I don't know the reason, 
and can't reproduction this error now. 

> Get checksum of EC file failed, when some block is missing or corrupt
> -
>
> Key: HDFS-15237
> URL: https://issues.apache.org/jira/browse/HDFS-15237
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, hdfs
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Priority: Major
> Fix For: 3.2.2
>
>
> When we distcp from an ec directory to another one, I found some error like 
> this.
> {code}
> 2020-03-20 20:18:21,366 WARN [main] 
> org.apache.hadoop.hdfs.FileChecksumHelper: src=/EC/6-3//000325_0, 
> datanodes[6]=DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK]2020-03-20
>  20:18:21,366 WARN [main] org.apache.hadoop.hdfs.FileChecksumHelper: 
> src=/EC/6-3//000325_0, 
> datanodes[6]=DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK]java.io.EOFException:
>  Unexpected EOF while trying to read response from server at 
> org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:550)
>  at 
> org.apache.hadoop.hdfs.FileChecksumHelper$StripedFileNonStripedChecksumComputer.tryDatanode(FileChecksumHelper.java:709)
>  at 
> org.apache.hadoop.hdfs.FileChecksumHelper$StripedFileNonStripedChecksumComputer.checksumBlockGroup(FileChecksumHelper.java:664)
>  at 
> org.apache.hado

[jira] [Commented] (HDFS-15237) Get checksum of EC file failed, when some block is missing or corrupt

2020-03-24 Thread zhengchenyu (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065461#comment-17065461
 ] 

zhengchenyu commented on HDFS-15237:


Here comment some strange phenomenon. Why this error happen? Because I found 
some wrong internal block distribution like this: 

{code}

0. BP-1936287042-10.200.128.33-1573194961291:blk_-9223372036797335472_3783205 
len=247453749 Live_repl=8 
[blk_-9223372036797335472:DatanodeInfoWithStorage[10.200.128.43:9866,DS-2ddde0b8-6a84-4d06-8a40-d4ae5691e81c,DISK],
 
 
blk_-9223372036797335471:DatanodeInfoWithStorage[10.200.128.41:9866,DS-a4fc5486-6c45-481e-84e7-9393eeaf1313,DISK],
 
 
blk_-9223372036797335470:DatanodeInfoWithStorage[10.200.128.50:9866,DS-fc0632c6-8916-42d8-8219-57b022bb2786,DISK],
 
 
blk_-9223372036797335469:DatanodeInfoWithStorage[10.200.128.54:9866,DS-1b6cb52a-f55a-4ef8-beaf-a5d7b7fe93aa,DISK],
 
 
blk_-9223372036797335467:DatanodeInfoWithStorage[10.200.128.52:9866,DS-fc6e00dd-ca5a-4580-9403-aeb6906da81a,DISK],
 
 
blk_-9223372036797335466:DatanodeInfoWithStorage[10.200.128.53:9866,DS-2c926a3b-64c0-441b-abe2-188e79918abe,DISK],
 
 
blk_-9223372036797335465:DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK],
 
 
blk_-9223372036797335464:DatanodeInfoWithStorage[10.200.128.44:9866,DS-3725af76-fe86-4f97-9740-d77bfa339b3f,DISK],
 
 
blk_-9223372036797335470:DatanodeInfoWithStorage[10.200.128.45:9866,DS-250fd4cf-705f-4cb5-bc3a-c7a105247e35,DISK]]

{code}

this is the result of hdfs fsck. Your can see this block group has 9 internal 
block, but no blk_-9223372036797335468, two repeated blk_-9223372036797335480. 

this block is too old so that the log is missing, so I don't know the reason, 
and can't reproduction this error now. 

> Get checksum of EC file failed, when some block is missing or corrupt
> -
>
> Key: HDFS-15237
> URL: https://issues.apache.org/jira/browse/HDFS-15237
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, hdfs
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Priority: Major
> Fix For: 3.2.2
>
>
> When we distcp from an ec directory to another one, I found some error like 
> this.
> {code}
> 2020-03-20 20:18:21,366 WARN [main] 
> org.apache.hadoop.hdfs.FileChecksumHelper: src=/EC/6-3//000325_0, 
> datanodes[6]=DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK]2020-03-20
>  20:18:21,366 WARN [main] org.apache.hadoop.hdfs.FileChecksumHelper: 
> src=/EC/6-3//000325_0, 
> datanodes[6]=DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK]java.io.EOFException:
>  Unexpected EOF while trying to read response from server at 
> org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:550)
>  at 
> org.apache.hadoop.hdfs.FileChecksumHelper$StripedFileNonStripedChecksumComputer.tryDatanode(FileChecksumHelper.java:709)
>  at 
> org.apache.hadoop.hdfs.FileChecksumHelper$StripedFileNonStripedChecksumComputer.checksumBlockGroup(FileChecksumHelper.java:664)
>  at 
> org.apache.hadoop.hdfs.FileChecksumHelper$StripedFileNonStripedChecksumComputer.checksumBlocks(FileChecksumHelper.java:638)
>  at 
> org.apache.hadoop.hdfs.FileChecksumHelper$FileChecksumComputer.compute(FileChecksumHelper.java:252)
>  at 
> org.apache.hadoop.hdfs.DFSClient.getFileChecksumInternal(DFSClient.java:1790) 
> at 
> org.apache.hadoop.hdfs.DFSClient.getFileChecksumWithCombineMode(DFSClient.java:1810)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem$33.doCall(DistributedFileSystem.java:1691)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem$33.doCall(DistributedFileSystem.java:1688)
>  at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileChecksum(DistributedFileSystem.java:1700)
>  at 
> org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:138)
>  at 
> org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:115)
>  at 
> org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)
>  at 
> org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:259)
>  at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:220) at 
> org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:48) at 
> org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) at 
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799) at 
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:347) at 
> org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.S

[jira] [Created] (HDFS-15237) Get checksum of EC file failed, when some block is missing or corrupt

2020-03-24 Thread zhengchenyu (Jira)

zhengchenyu created HDFS-15237:
--

 Summary: Get checksum of EC file failed, when some block is 
missing or corrupt
 Key: HDFS-15237
 URL: https://issues.apache.org/jira/browse/HDFS-15237
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ec, hdfs
Affects Versions: 3.2.1
Reporter: zhengchenyu
 Fix For: 3.2.2


When we distcp from an ec directory to another one, I found some error like 
this.

{code}

2020-03-20 20:18:21,366 WARN [main] org.apache.hadoop.hdfs.FileChecksumHelper: 
src=/EC/6-3//000325_0, 
datanodes[6]=DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK]2020-03-20
 20:18:21,366 WARN [main] org.apache.hadoop.hdfs.FileChecksumHelper: 
src=/EC/6-3//000325_0, 
datanodes[6]=DatanodeInfoWithStorage[10.200.128.40:9866,DS-65ac4407-9d33-4c59-8f72-dd1d80d26d9f,DISK]java.io.EOFException:
 Unexpected EOF while trying to read response from server at 
org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:550)
 at 
org.apache.hadoop.hdfs.FileChecksumHelper$StripedFileNonStripedChecksumComputer.tryDatanode(FileChecksumHelper.java:709)
 at 
org.apache.hadoop.hdfs.FileChecksumHelper$StripedFileNonStripedChecksumComputer.checksumBlockGroup(FileChecksumHelper.java:664)
 at 
org.apache.hadoop.hdfs.FileChecksumHelper$StripedFileNonStripedChecksumComputer.checksumBlocks(FileChecksumHelper.java:638)
 at 
org.apache.hadoop.hdfs.FileChecksumHelper$FileChecksumComputer.compute(FileChecksumHelper.java:252)
 at 
org.apache.hadoop.hdfs.DFSClient.getFileChecksumInternal(DFSClient.java:1790) 
at 
org.apache.hadoop.hdfs.DFSClient.getFileChecksumWithCombineMode(DFSClient.java:1810)
 at 
org.apache.hadoop.hdfs.DistributedFileSystem$33.doCall(DistributedFileSystem.java:1691)
 at 
org.apache.hadoop.hdfs.DistributedFileSystem$33.doCall(DistributedFileSystem.java:1688)
 at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileChecksum(DistributedFileSystem.java:1700)
 at 
org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:138)
 at 
org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:115)
 at 
org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87) 
at 
org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:259)
 at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:220) at 
org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:48) at 
org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) at 
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799) at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:347) at 
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:422) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)

{code}

And Then I found some error in datanode like this

{code}

2020-03-20 20:54:16,573 INFO 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient: SASL 
encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2020-03-20 20:54:16,577 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
bd-hadoop-128050.zeus.lianjia.com:9866:DataXceiver error processing 
BLOCK_GROUP_CHECKSUM operation src: /10.201.1.38:33264 dst: /10.200.128.50:9866
java.lang.UnsupportedOperationException
 at java.nio.ByteBuffer.array(ByteBuffer.java:994)
 at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockChecksumReconstructor.reconstruct(StripedBlockChecksumReconstructor.java:90)
 at 
org.apache.hadoop.hdfs.server.datanode.BlockChecksumHelper$BlockGroupNonStripedChecksumComputer.recalculateChecksum(BlockChecksumHelper.java:711)
 at 
org.apache.hadoop.hdfs.server.datanode.BlockChecksumHelper$BlockGroupNonStripedChecksumComputer.compute(BlockChecksumHelper.java:489)
 at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.blockGroupChecksum(DataXceiver.java:1047)
 at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opStripedBlockChecksum(Receiver.java:327)
 at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:119)
 at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:292)
 at java.lang.Thread.run(Thread.java:748)

{code}

The reason is that: When some block is missing or corrupt, datanode will 
trigger to call recalculateChecksum. But if 
StripedBlockChecksumReconstructor.targetBuffer is DirectByteBuffer, we couldn't 
use DirectByteBuffer.array(), so throw the exception. Then we could't get 
checksum.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-

[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2020-03-24 Thread Sean Chow (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065398#comment-17065398
 ] 

Sean Chow commented on HDFS-14476:
--

Hi [~weichiu], I can't see why it builds failed. Any chance to merge this to 
2.10.1?

> lock too long when fix inconsistent blocks between disk and in-memory
> -
>
> Key: HDFS-14476
> URL: https://issues.apache.org/jira/browse/HDFS-14476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0, 2.7.0, 3.0.3
>Reporter: Sean Chow
>Assignee: Sean Chow
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14476-branch-2.01.patch, 
> HDFS-14476-branch-2.02.patch, HDFS-14476.00.patch, HDFS-14476.002.patch, 
> HDFS-14476.01.patch, HDFS-14476.branch-3.2.001.patch, 
> datanode-with-patch-14476.png
>
>
> When directoryScanner have the results of differences between disk and 
> in-memory blocks. it will try to run {{checkAndUpdate}} to fix it. However 
> {{FsDatasetImpl.checkAndUpdate}} is a synchronized call
> As I have about 6millions blocks for every datanodes and every 6hours' scan 
> will have about 25000 abnormal blocks to fix. That leads to a long lock 
> holding FsDatasetImpl object.
> let's assume every block need 10ms to fix(because of latency of SAS disk), 
> that will cost 250 seconds to finish. That means all reads and writes will be 
> blocked for 3mins for that datanode.
>  
> {code:java}
> 2019-05-06 08:06:51,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1644920766-10.223.143.220-1450099987967 Total blocks: 6850197, missing 
> metadata files:23574, missing block files:23574, missing blocks in 
> memory:47625, mismatched blocks:0
> ...
> 2019-05-06 08:16:41,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Took 588402ms to process 1 commands from NN
> {code}
> Take long time to process command from nn because threads are blocked. And 
> namenode will see long lastContact time for this datanode.
> Maybe this affect all hdfs versions.
> *how to fix:*
> just like process invalidate command from namenode with 1000 batch size, fix 
> these abnormal block should be handled with batch too and sleep 2 seconds 
> between the batch to allow normal reading/writing blocks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-14429) Block remain in COMMITTED but not COMPLETE caused by Decommission

2020-03-24 Thread dark_num (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065375#comment-17065375
 ] 

dark_num edited comment on HDFS-14429 at 3/24/20, 7:31 AM:
---

{code:java}
// Why not use this method to judge the replica state
// It also include the "MAINTENANCE" case, avoid wrong calculation
// because currentLiveReplic only includes "normal/live" state
if (result == AddBlockResult.ADDED) {
  curReplicaDelta = (node.isInService()) ? 1 : 0;
  //...
}

public enum AdminStates {
  NORMAL("In Service"),
  DECOMMISSION_INPROGRESS("Decommission In Progress"),
  DECOMMISSIONED("Decommissioned"),
  ENTERING_MAINTENANCE("Entering Maintenance"),
  IN_MAINTENANCE("In Maintenance");
  //...
}{code}
[~caiyicong] Thank you for your efforts and look forward to your reply,  

 


was (Author: dark_num):
{code:java}
// Why not use this method to judge the replica state
// It also include the "MAINTENANCE" case, avoid wrong calculation
if (result == AddBlockResult.ADDED) {
  curReplicaDelta = (node.isInService()) ? 1 : 0;
  //...
}

public enum AdminStates {
  NORMAL("In Service"),
  DECOMMISSION_INPROGRESS("Decommission In Progress"),
  DECOMMISSIONED("Decommissioned"),
  ENTERING_MAINTENANCE("Entering Maintenance"),
  IN_MAINTENANCE("In Maintenance");
  //...
}{code}
[~caiyicong] Thank you for your efforts and look forward to your reply,  

 

> Block remain in COMMITTED but not COMPLETE caused by Decommission
> -
>
> Key: HDFS-14429
> URL: https://issues.apache.org/jira/browse/HDFS-14429
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Yicong Cai
>Priority: Major
> Fix For: 2.10.0, 3.3.0, 3.2.1, 2.9.3, 3.1.3
>
> Attachments: HDFS-14429.01.patch, HDFS-14429.02.patch, 
> HDFS-14429.03.patch, HDFS-14429.branch-2.01.patch, 
> HDFS-14429.branch-2.02.patch
>
>
> In the following scenario, the Block will remain in the COMMITTED but not 
> COMPLETE state and cannot be closed properly:
>  # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3).
>  # bk1 has been completely written to three data nodes, and the data node 
> succeeds FinalizeBlock, joins IBR and waits to report to NameNode.
>  # The client commits bk1 after receiving the ACK.
>  # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 
> enter Decommissioning.
>  # The DN reports the IBR, but the block cannot be completed normally.
>  
> Then it will lead to the following related exceptions:
> {panel:title=Exception}
> 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem 
> (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* 
> blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= 
> minimum = 1) in file xxx
> 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - 
> IPC Server handler 499 on 8020, call Call#122552 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615
> org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not 
> replicated yet: xxx
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:415)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
> {panel}
> And will cause the scenario described in HDFS-12747
> The root cause is that addStoredBlock does not consider the case where the 
> replications are in Decommission.
> This problem needs to be fixed like HDFS-11499.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-14429) Block remain in COMMITTED but not COMPLETE caused by Decommission

2020-03-24 Thread dark_num (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065375#comment-17065375
 ] 

dark_num edited comment on HDFS-14429 at 3/24/20, 7:30 AM:
---

{code:java}
// Why not use this method to judge the replica state
// It also include the "MAINTENANCE" case, avoid wrong calculation
if (result == AddBlockResult.ADDED) {
  curReplicaDelta = (node.isInService()) ? 1 : 0;
  //...
}

public enum AdminStates {
  NORMAL("In Service"),
  DECOMMISSION_INPROGRESS("Decommission In Progress"),
  DECOMMISSIONED("Decommissioned"),
  ENTERING_MAINTENANCE("Entering Maintenance"),
  IN_MAINTENANCE("In Maintenance");
  //...
}{code}
[~caiyicong] Thank you for your efforts and look forward to your reply,  

 


was (Author: dark_num):
{code:java}
// Why not use this method to judge the replica state
// It also include the "MAINTENANCE" case, avoid wrong calculation
if (result == AddBlockResult.ADDED) {
  curReplicaDelta = (node.isInService()) ? 0 : 1;
  //...
}

public enum AdminStates {
  NORMAL("In Service"),
  DECOMMISSION_INPROGRESS("Decommission In Progress"),
  DECOMMISSIONED("Decommissioned"),
  ENTERING_MAINTENANCE("Entering Maintenance"),
  IN_MAINTENANCE("In Maintenance");
  //...
}{code}
[~caiyicong] Thank you for your efforts and look forward to your reply,  

 

> Block remain in COMMITTED but not COMPLETE caused by Decommission
> -
>
> Key: HDFS-14429
> URL: https://issues.apache.org/jira/browse/HDFS-14429
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Yicong Cai
>Priority: Major
> Fix For: 2.10.0, 3.3.0, 3.2.1, 2.9.3, 3.1.3
>
> Attachments: HDFS-14429.01.patch, HDFS-14429.02.patch, 
> HDFS-14429.03.patch, HDFS-14429.branch-2.01.patch, 
> HDFS-14429.branch-2.02.patch
>
>
> In the following scenario, the Block will remain in the COMMITTED but not 
> COMPLETE state and cannot be closed properly:
>  # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3).
>  # bk1 has been completely written to three data nodes, and the data node 
> succeeds FinalizeBlock, joins IBR and waits to report to NameNode.
>  # The client commits bk1 after receiving the ACK.
>  # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 
> enter Decommissioning.
>  # The DN reports the IBR, but the block cannot be completed normally.
>  
> Then it will lead to the following related exceptions:
> {panel:title=Exception}
> 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem 
> (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* 
> blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= 
> minimum = 1) in file xxx
> 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - 
> IPC Server handler 499 on 8020, call Call#122552 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615
> org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not 
> replicated yet: xxx
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:415)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
> {panel}
> And will cause the scenario described in HDFS-12747
> The root cause is that addStoredBlock does not consider the case where the 
> replications are in Decommission.
> This problem needs to be fixed like HDFS-11499.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-14429) Block remain in COMMITTED but not COMPLETE caused by Decommission

2020-03-24 Thread dark_num (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065375#comment-17065375
 ] 

dark_num edited comment on HDFS-14429 at 3/24/20, 7:28 AM:
---

{code:java}
// Why not use this method to judge the replica state
// It also include the "MAINTENANCE" case, avoid wrong calculation
if (result == AddBlockResult.ADDED) {
  curReplicaDelta = (node.isInService()) ? 0 : 1;
  //...
}

public enum AdminStates {
  NORMAL("In Service"),
  DECOMMISSION_INPROGRESS("Decommission In Progress"),
  DECOMMISSIONED("Decommissioned"),
  ENTERING_MAINTENANCE("Entering Maintenance"),
  IN_MAINTENANCE("In Maintenance");
  //...
}{code}
[~caiyicong] Thank you for your efforts and look forward to your reply,  

 


was (Author: dark_num):
{code:java}
// Why not use this method to judge the replica state
// It also include the "MAINTENANCE" case, avoid wrong calculation
if (result == AddBlockResult.ADDED) {
  curReplicaDelta = (node.isInService()) ? 0 : 1;
  //...
}
{code}
 Thank you for your efforts and look forward to your reply,  

 

> Block remain in COMMITTED but not COMPLETE caused by Decommission
> -
>
> Key: HDFS-14429
> URL: https://issues.apache.org/jira/browse/HDFS-14429
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Yicong Cai
>Priority: Major
> Fix For: 2.10.0, 3.3.0, 3.2.1, 2.9.3, 3.1.3
>
> Attachments: HDFS-14429.01.patch, HDFS-14429.02.patch, 
> HDFS-14429.03.patch, HDFS-14429.branch-2.01.patch, 
> HDFS-14429.branch-2.02.patch
>
>
> In the following scenario, the Block will remain in the COMMITTED but not 
> COMPLETE state and cannot be closed properly:
>  # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3).
>  # bk1 has been completely written to three data nodes, and the data node 
> succeeds FinalizeBlock, joins IBR and waits to report to NameNode.
>  # The client commits bk1 after receiving the ACK.
>  # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 
> enter Decommissioning.
>  # The DN reports the IBR, but the block cannot be completed normally.
>  
> Then it will lead to the following related exceptions:
> {panel:title=Exception}
> 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem 
> (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* 
> blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= 
> minimum = 1) in file xxx
> 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - 
> IPC Server handler 499 on 8020, call Call#122552 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615
> org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not 
> replicated yet: xxx
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:415)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
> {panel}
> And will cause the scenario described in HDFS-12747
> The root cause is that addStoredBlock does not consider the case where the 
> replications are in Decommission.
> This problem needs to be fixed like HDFS-11499.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-14429) Block remain in COMMITTED but not COMPLETE caused by Decommission

2020-03-24 Thread dark_num (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065375#comment-17065375
 ] 

dark_num edited comment on HDFS-14429 at 3/24/20, 7:26 AM:
---

{code:java}
// Why not use this method to judge the replica state
// It also include the "MAINTENANCE" case, avoid wrong calculation
if (result == AddBlockResult.ADDED) {
  curReplicaDelta = (node.isInService()) ? 0 : 1;
  //...
}
{code}
 Thank you for your efforts and look forward to your reply,  

 


was (Author: dark_num):
{code:java}
// Why not use this method to judge the replica state
// It also include the "MAINTENANCE" case, avoid wrong calculation
if (result == AddBlockResult.ADDED) {
  curReplicaDelta = (node.isInService()) ? 0 : 1;
  //...
}


{code}
 Thank you for your efforts and look forward to your reply,  

 

> Block remain in COMMITTED but not COMPLETE caused by Decommission
> -
>
> Key: HDFS-14429
> URL: https://issues.apache.org/jira/browse/HDFS-14429
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Yicong Cai
>Priority: Major
> Fix For: 2.10.0, 3.3.0, 3.2.1, 2.9.3, 3.1.3
>
> Attachments: HDFS-14429.01.patch, HDFS-14429.02.patch, 
> HDFS-14429.03.patch, HDFS-14429.branch-2.01.patch, 
> HDFS-14429.branch-2.02.patch
>
>
> In the following scenario, the Block will remain in the COMMITTED but not 
> COMPLETE state and cannot be closed properly:
>  # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3).
>  # bk1 has been completely written to three data nodes, and the data node 
> succeeds FinalizeBlock, joins IBR and waits to report to NameNode.
>  # The client commits bk1 after receiving the ACK.
>  # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 
> enter Decommissioning.
>  # The DN reports the IBR, but the block cannot be completed normally.
>  
> Then it will lead to the following related exceptions:
> {panel:title=Exception}
> 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem 
> (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* 
> blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= 
> minimum = 1) in file xxx
> 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - 
> IPC Server handler 499 on 8020, call Call#122552 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615
> org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not 
> replicated yet: xxx
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:415)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
> {panel}
> And will cause the scenario described in HDFS-12747
> The root cause is that addStoredBlock does not consider the case where the 
> replications are in Decommission.
> This problem needs to be fixed like HDFS-11499.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-14429) Block remain in COMMITTED but not COMPLETE caused by Decommission

2020-03-24 Thread dark_num (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065375#comment-17065375
 ] 

dark_num edited comment on HDFS-14429 at 3/24/20, 7:25 AM:
---

{code:java}
// Why not use this method to judge the replica state
// It also include the "MAINTENANCE" case, avoid wrong calculation
if (result == AddBlockResult.ADDED) {
  curReplicaDelta = (node.isInService()) ? 0 : 1;
  //...
}


{code}
 Thank you for your efforts and look forward to your reply,  

 


was (Author: dark_num):
{code:java}
// Why not use this method to judge the replica state
// It also include the "MAINTENANCE" case, avoid wrong calculation
if (result == AddBlockResult.ADDED) {
  curReplicaDelta = (node.isInService()) ? 0 : 1;
  //...
}

Thank you for your efforts and look forward to your reply

{code}
 
 
 

 

> Block remain in COMMITTED but not COMPLETE caused by Decommission
> -
>
> Key: HDFS-14429
> URL: https://issues.apache.org/jira/browse/HDFS-14429
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Yicong Cai
>Priority: Major
> Fix For: 2.10.0, 3.3.0, 3.2.1, 2.9.3, 3.1.3
>
> Attachments: HDFS-14429.01.patch, HDFS-14429.02.patch, 
> HDFS-14429.03.patch, HDFS-14429.branch-2.01.patch, 
> HDFS-14429.branch-2.02.patch
>
>
> In the following scenario, the Block will remain in the COMMITTED but not 
> COMPLETE state and cannot be closed properly:
>  # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3).
>  # bk1 has been completely written to three data nodes, and the data node 
> succeeds FinalizeBlock, joins IBR and waits to report to NameNode.
>  # The client commits bk1 after receiving the ACK.
>  # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 
> enter Decommissioning.
>  # The DN reports the IBR, but the block cannot be completed normally.
>  
> Then it will lead to the following related exceptions:
> {panel:title=Exception}
> 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem 
> (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* 
> blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= 
> minimum = 1) in file xxx
> 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - 
> IPC Server handler 499 on 8020, call Call#122552 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615
> org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not 
> replicated yet: xxx
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:415)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
> {panel}
> And will cause the scenario described in HDFS-12747
> The root cause is that addStoredBlock does not consider the case where the 
> replications are in Decommission.
> This problem needs to be fixed like HDFS-11499.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-14429) Block remain in COMMITTED but not COMPLETE caused by Decommission

2020-03-24 Thread dark_num (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065375#comment-17065375
 ] 

dark_num commented on HDFS-14429:
-

{code:java}
// Why not use this method to judge the replica state
// It also include the "MAINTENANCE" case, avoid wrong calculation
if (result == AddBlockResult.ADDED) {
  curReplicaDelta = (node.isInService()) ? 0 : 1;
  //...
}

Thank you for your efforts and look forward to your reply

{code}
 
 
 

 

> Block remain in COMMITTED but not COMPLETE caused by Decommission
> -
>
> Key: HDFS-14429
> URL: https://issues.apache.org/jira/browse/HDFS-14429
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Yicong Cai
>Priority: Major
> Fix For: 2.10.0, 3.3.0, 3.2.1, 2.9.3, 3.1.3
>
> Attachments: HDFS-14429.01.patch, HDFS-14429.02.patch, 
> HDFS-14429.03.patch, HDFS-14429.branch-2.01.patch, 
> HDFS-14429.branch-2.02.patch
>
>
> In the following scenario, the Block will remain in the COMMITTED but not 
> COMPLETE state and cannot be closed properly:
>  # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3).
>  # bk1 has been completely written to three data nodes, and the data node 
> succeeds FinalizeBlock, joins IBR and waits to report to NameNode.
>  # The client commits bk1 after receiving the ACK.
>  # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 
> enter Decommissioning.
>  # The DN reports the IBR, but the block cannot be completed normally.
>  
> Then it will lead to the following related exceptions:
> {panel:title=Exception}
> 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem 
> (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* 
> blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= 
> minimum = 1) in file xxx
> 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - 
> IPC Server handler 499 on 8020, call Call#122552 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615
> org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not 
> replicated yet: xxx
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:415)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
> {panel}
> And will cause the scenario described in HDFS-12747
> The root cause is that addStoredBlock does not consider the case where the 
> replications are in Decommission.
> This problem needs to be fixed like HDFS-11499.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

77 matches

Mail list logo