[jira] [Updated] (HDFS-17535) I have confirmed the EC corrupt file, can this corrupt file be restored?

2024-07-16 Thread ruiliang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ruiliang updated HDFS-17535:

Description: 
我了解到 EC 确实存在文件损坏的重错误
https://issues.apache.org/jira/browse/HDFS-15759

1:我已确认 EC 损坏文件,此损坏文件可以恢复吗?
有重要数据导致我们生产数据丢失问题?有办法恢复吗?
检查 EC 块组:blk_-9223372036361352768
状态:错误,消息:EC 计算结果不匹配。:ip 为 10.12.66.116 块为:-9223372036361352765

2:[https://github.com/apache/orc/issues/1939]我想知道如果你选择了你当前的代码(GitHub pull 
request #2869),我可以跳过与HDFS-14768,HDFS-15186, 和HDFS-15240?

hdfs 版本 3.1.0

谢谢

 

Latest findings: It is a machine network problem, the cpu si(soft interrupt) is 
too high, nn loses dn heartbeat, nn sends to dn to recover and reconstruct.

Because the Weaver-Scope service of k8s is installed on the server, conntrack 
interruption times out seriously, affecting all network usage.

  was:
我了解到 EC 确实存在文件损坏的重大错误
https://issues.apache.org/jira/browse/HDFS-15759

1:我已确认 EC 损坏文件,此损坏文件可以恢复吗?
有重要数据导致我们生产数据丢失问题?有办法恢复吗?
检查 EC 块组:blk_-9223372036361352768
状态:错误,消息:EC 计算结果不匹配。:ip 为 10.12.66.116 块为:-9223372036361352765

2:[https://github.com/apache/orc/issues/1939]我想知道如果你选择了你当前的代码(GitHub pull 
request #2869),我可以跳过与HDFS-14768,HDFS-15186, 和HDFS-15240?

hdfs 版本 3.1.0

谢谢

 

Latest findings: It is a machine network problem, the cpu si(soft interrupt) is 
too high, nn loses dn heartbeat, nn sends to dn to recover and reconstruct.

Because the Weaver-Scope service of k8s is installed on the server, conntrack 
interruption times out seriously, affecting all network usage.


> I have confirmed the EC corrupt file, can this corrupt file be restored?
> 
>
> Key: HDFS-17535
> URL: https://issues.apache.org/jira/browse/HDFS-17535
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, hdfs
>Affects Versions: 3.1.0
>Reporter: ruiliang
>Priority: Blocker
>
> 我了解到 EC 确实存在文件损坏的重错误
> https://issues.apache.org/jira/browse/HDFS-15759
> 1:我已确认 EC 损坏文件,此损坏文件可以恢复吗?
> 有重要数据导致我们生产数据丢失问题?有办法恢复吗?
> 检查 EC 块组:blk_-9223372036361352768
> 状态:错误,消息:EC 计算结果不匹配。:ip 为 10.12.66.116 块为:-9223372036361352765
> 2:[https://github.com/apache/orc/issues/1939]我想知道如果你选择了你当前的代码(GitHub pull 
> request #2869),我可以跳过与HDFS-14768,HDFS-15186, 和HDFS-15240?
> hdfs 版本 3.1.0
> 谢谢
>  
> Latest findings: It is a machine network problem, the cpu si(soft interrupt) 
> is too high, nn loses dn heartbeat, nn sends to dn to recover and reconstruct.
> Because the Weaver-Scope service of k8s is installed on the server, conntrack 
> interruption times out seriously, affecting all network usage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17535) I have confirmed the EC corrupt file, can this corrupt file be restored?

2024-07-16 Thread ruiliang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ruiliang updated HDFS-17535:

Description: 
我了解到 EC 确实存在文件损坏的重大错误
https://issues.apache.org/jira/browse/HDFS-15759

1:我已确认 EC 损坏文件,此损坏文件可以恢复吗?
有重要数据导致我们生产数据丢失问题?有办法恢复吗?
检查 EC 块组:blk_-9223372036361352768
状态:错误,消息:EC 计算结果不匹配。:ip 为 10.12.66.116 块为:-9223372036361352765

2:[https://github.com/apache/orc/issues/1939]我想知道如果你选择了你当前的代码(GitHub pull 
request #2869),我可以跳过与HDFS-14768,HDFS-15186, 和HDFS-15240?

hdfs 版本 3.1.0

谢谢

 

Latest findings: It is a machine network problem, the cpu si(soft interrupt) is 
too high, nn loses dn heartbeat, nn sends to dn to recover and reconstruct.

Because the Weaver-Scope service of k8s is installed on the server, conntrack 
interruption times out seriously, affecting all network usage.

  was:
I learned that EC does have a major bug with file corrupt
https://issues.apache.org/jira/browse/HDFS-15759

1:I have confirmed the EC corrupt file, can this corrupt file be restored?
Have important data that is causing us production data loss issues?   Is there 
a way to recover
Checking EC block group: blk_-9223372036361352768
Status: ERROR, message: EC compute result not match.:ip is 10.12.66.116 block 
is : -9223372036361352765

2:[https://github.com/apache/orc/issues/1939] I was wondering if cherry picked 
your current code (GitHub pull request #2869), Can I skip patches related to 
HDFS-14768,HDFS-15186, and HDFS-15240?

hdfs  version 3.1.0

thank you


> I have confirmed the EC corrupt file, can this corrupt file be restored?
> 
>
> Key: HDFS-17535
> URL: https://issues.apache.org/jira/browse/HDFS-17535
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, hdfs
>Affects Versions: 3.1.0
>Reporter: ruiliang
>Priority: Blocker
>
> 我了解到 EC 确实存在文件损坏的重大错误
> https://issues.apache.org/jira/browse/HDFS-15759
> 1:我已确认 EC 损坏文件,此损坏文件可以恢复吗?
> 有重要数据导致我们生产数据丢失问题?有办法恢复吗?
> 检查 EC 块组:blk_-9223372036361352768
> 状态:错误,消息:EC 计算结果不匹配。:ip 为 10.12.66.116 块为:-9223372036361352765
> 2:[https://github.com/apache/orc/issues/1939]我想知道如果你选择了你当前的代码(GitHub pull 
> request #2869),我可以跳过与HDFS-14768,HDFS-15186, 和HDFS-15240?
> hdfs 版本 3.1.0
> 谢谢
>  
> Latest findings: It is a machine network problem, the cpu si(soft interrupt) 
> is too high, nn loses dn heartbeat, nn sends to dn to recover and reconstruct.
> Because the Weaver-Scope service of k8s is installed on the server, conntrack 
> interruption times out seriously, affecting all network usage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-17535) I have confirmed the EC corrupt file, can this corrupt file be restored?

2024-06-27 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854008#comment-17854008
 ] 

ruiliang edited comment on HDFS-17535 at 6/27/24 1:34 PM:
--

[https://github.com/liangrui1988/hadoop-client-op/blob/main/src/main/java/com/yy/bigdata/orc/OpenFileLine.java]

After studying for a long time, a bad block recovery method has been realized. 
Two bad blocks (rs-3-1024) need to exclude datanode read the [orc,txt,txt 
gzip,parquet]file for 10 times, and then check whether the [orc,txt,txt 
gzip,parquet]is legitimate. The recovery program is as follows, with changes to 
the source code, and the relevant jar is in the lib/ directory.

 
{code:java}
   1:check ec file & return sigle block error to datanode ip info
   2:read ec file & skip block error datanode ip to copy new dir
   3:orc check read (Verify according to your own file format)
   4:if error block >1 (for all datanode read data)
   5:for all datanode Still unable to recover,The data is completely 
blocked{code}
 

It would be best if a community official provided this feature.


was (Author: ruilaing):
[https://github.com/liangrui1988/hadoop-client-op/blob/main/src/main/java/com/yy/bigdata/orc/OpenFileLine.java]

After studying for a long time, a bad block recovery method has been realized. 
Two bad blocks (rs-3-1024) need to exclude datanode read the [orc,txt,txt 
gzip,parquet]file for 10 times, and then check whether the orc is legitimate. 
The recovery program is as follows, with changes to the source code, and the 
relevant jar is in the lib/ directory.

 
{code:java}
   1:check ec file & return sigle block error to datanode ip info
   2:read ec file & skip block error datanode ip to copy new dir
   3:orc check read (Verify according to your own file format)
   4:if error block >1 (for all datanode read data)
   5:for all datanode Still unable to recover,The data is completely 
blocked{code}
 

It would be best if a community official provided this feature.

> I have confirmed the EC corrupt file, can this corrupt file be restored?
> 
>
> Key: HDFS-17535
> URL: https://issues.apache.org/jira/browse/HDFS-17535
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, hdfs
>Affects Versions: 3.1.0
>Reporter: ruiliang
>Priority: Blocker
>
> I learned that EC does have a major bug with file corrupt
> https://issues.apache.org/jira/browse/HDFS-15759
> 1:I have confirmed the EC corrupt file, can this corrupt file be restored?
> Have important data that is causing us production data loss issues?   Is 
> there a way to recover
> Checking EC block group: blk_-9223372036361352768
> Status: ERROR, message: EC compute result not match.:ip is 10.12.66.116 block 
> is : -9223372036361352765
> 2:[https://github.com/apache/orc/issues/1939] I was wondering if cherry 
> picked your current code (GitHub pull request #2869), Can I skip patches 
> related to HDFS-14768,HDFS-15186, and HDFS-15240?
> hdfs  version 3.1.0
> thank you



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17535) I have confirmed the EC corrupt file, can this corrupt file be restored?

2024-06-27 Thread ruiliang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ruiliang updated HDFS-17535:

Description: 
I learned that EC does have a major bug with file corrupt
https://issues.apache.org/jira/browse/HDFS-15759

1:I have confirmed the EC corrupt file, can this corrupt file be restored?
Have important data that is causing us production data loss issues?   Is there 
a way to recover
Checking EC block group: blk_-9223372036361352768
Status: ERROR, message: EC compute result not match.:ip is 10.12.66.116 block 
is : -9223372036361352765

2:[https://github.com/apache/orc/issues/1939] I was wondering if cherry picked 
your current code (GitHub pull request #2869), Can I skip patches related to 
HDFS-14768,HDFS-15186, and HDFS-15240?

hdfs  version 3.1.0

thank you

  was:
I learned that EC does have a major bug with file corrupt
https://issues.apache.org/jira/browse/HDFS-15759


1:I have confirmed the EC corrupt file, can this corrupt file be restored?
Have important data that is causing us production data loss issues?   Is there 
a way to recover
corrupt;/file;corrupt block groups \{blk_-xx} zeroParityBlockGroups 
\{blk_-xx[blk_-xx]}

2:https://github.com/apache/orc/issues/1939 I was wondering if cherry picked 
your current code (GitHub pull request #2869), Can I skip patches related to 
HDFS-14768,HDFS-15186, and HDFS-15240?


hdfs  version 3.1.0

thank you


> I have confirmed the EC corrupt file, can this corrupt file be restored?
> 
>
> Key: HDFS-17535
> URL: https://issues.apache.org/jira/browse/HDFS-17535
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, hdfs
>Affects Versions: 3.1.0
>Reporter: ruiliang
>Priority: Blocker
>
> I learned that EC does have a major bug with file corrupt
> https://issues.apache.org/jira/browse/HDFS-15759
> 1:I have confirmed the EC corrupt file, can this corrupt file be restored?
> Have important data that is causing us production data loss issues?   Is 
> there a way to recover
> Checking EC block group: blk_-9223372036361352768
> Status: ERROR, message: EC compute result not match.:ip is 10.12.66.116 block 
> is : -9223372036361352765
> 2:[https://github.com/apache/orc/issues/1939] I was wondering if cherry 
> picked your current code (GitHub pull request #2869), Can I skip patches 
> related to HDFS-14768,HDFS-15186, and HDFS-15240?
> hdfs  version 3.1.0
> thank you



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-17535) I have confirmed the EC corrupt file, can this corrupt file be restored?

2024-06-27 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854008#comment-17854008
 ] 

ruiliang edited comment on HDFS-17535 at 6/27/24 8:28 AM:
--

[https://github.com/liangrui1988/hadoop-client-op/blob/main/src/main/java/com/yy/bigdata/orc/OpenFileLine.java]

After studying for a long time, a bad block recovery method has been realized. 
Two bad blocks (rs-3-1024) need to exclude datanode read the [orc,txt,txt 
gzip,parquet]file for 10 times, and then check whether the orc is legitimate. 
The recovery program is as follows, with changes to the source code, and the 
relevant jar is in the lib/ directory.

 
{code:java}
   1:check ec file & return sigle block error to datanode ip info
   2:read ec file & skip block error datanode ip to copy new dir
   3:orc check read (Verify according to your own file format)
   4:if error block >1 (for all datanode read data)
   5:for all datanode Still unable to recover,The data is completely 
blocked{code}
 

It would be best if a community official provided this feature.


was (Author: ruilaing):
[https://github.com/liangrui1988/hadoop-client-op/blob/main/src/main/java/com/yy/bigdata/orc/OpenFileLine.java]

After studying for a long time, a bad block recovery method has been realized. 
Two bad blocks (rs-3-1024) need to exclude datanode read the orc file for 10 
times, and then check whether the orc is legitimate. The recovery program is as 
follows, with changes to the source code, and the relevant jar is in the lib/ 
directory.

 
{code:java}
   1:check ec file & return sigle block error to datanode ip info
   2:read ec file & skip block error datanode ip to copy new dir
   3:orc check read (Verify according to your own file format)
   4:if error block >1 (for all datanode read data)
   5:for all datanode Still unable to recover,The data is completely 
blocked{code}
 

It would be best if a community official provided this feature.

> I have confirmed the EC corrupt file, can this corrupt file be restored?
> 
>
> Key: HDFS-17535
> URL: https://issues.apache.org/jira/browse/HDFS-17535
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, hdfs
>Affects Versions: 3.1.0
>Reporter: ruiliang
>Priority: Blocker
>
> I learned that EC does have a major bug with file corrupt
> https://issues.apache.org/jira/browse/HDFS-15759
> 1:I have confirmed the EC corrupt file, can this corrupt file be restored?
> Have important data that is causing us production data loss issues?   Is 
> there a way to recover
> corrupt;/file;corrupt block groups \{blk_-xx} zeroParityBlockGroups 
> \{blk_-xx[blk_-xx]}
> 2:https://github.com/apache/orc/issues/1939 I was wondering if cherry picked 
> your current code (GitHub pull request #2869), Can I skip patches related to 
> HDFS-14768,HDFS-15186, and HDFS-15240?
> hdfs  version 3.1.0
> thank you



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-17535) I have confirmed the EC corrupt file, can this corrupt file be restored?

2024-06-11 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854008#comment-17854008
 ] 

ruiliang edited comment on HDFS-17535 at 6/11/24 11:51 AM:
---

[https://github.com/liangrui1988/hadoop-client-op/blob/main/src/main/java/com/yy/bigdata/orc/OpenFileLine.java]

After studying for a long time, a bad block recovery method has been realized. 
Two bad blocks (rs-3-1024) need to exclude datanode read the orc file for 10 
times, and then check whether the orc is legitimate. The recovery program is as 
follows, with changes to the source code, and the relevant jar is in the lib/ 
directory.

 
{code:java}
   1:check ec file & return sigle block error to datanode ip info
   2:read ec file & skip block error datanode ip to copy new dir
   3:orc check read (Verify according to your own file format)
   4:if error block >1 (for all datanode read data)
   5:for all datanode Still unable to recover,The data is completely 
blocked{code}
 

It would be best if a community official provided this feature.


was (Author: ruilaing):
[https://github.com/liangrui1988/hadoop-client-op/blob/main/src/main/java/com/yy/bigdata/orc/OpenFileLine.java]

After studying for a long time, a bad block recovery method has been realized. 
Two bad blocks (rs-3-1024) need to exclude datanode read the orc file for 10 
times, and then check whether the orc is legitimate. The recovery program is as 
follows, with changes to the source code, and the relevant jar is in the lib/ 
directory.

It would be best if a community official provided this feature.

> I have confirmed the EC corrupt file, can this corrupt file be restored?
> 
>
> Key: HDFS-17535
> URL: https://issues.apache.org/jira/browse/HDFS-17535
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, hdfs
>Affects Versions: 3.1.0
>Reporter: ruiliang
>Priority: Blocker
>
> I learned that EC does have a major bug with file corrupt
> https://issues.apache.org/jira/browse/HDFS-15759
> 1:I have confirmed the EC corrupt file, can this corrupt file be restored?
> Have important data that is causing us production data loss issues?   Is 
> there a way to recover
> corrupt;/file;corrupt block groups \{blk_-xx} zeroParityBlockGroups 
> \{blk_-xx[blk_-xx]}
> 2:https://github.com/apache/orc/issues/1939 I was wondering if cherry picked 
> your current code (GitHub pull request #2869), Can I skip patches related to 
> HDFS-14768,HDFS-15186, and HDFS-15240?
> hdfs  version 3.1.0
> thank you



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17535) I have confirmed the EC corrupt file, can this corrupt file be restored?

2024-06-11 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854008#comment-17854008
 ] 

ruiliang commented on HDFS-17535:
-

[https://github.com/liangrui1988/hadoop-client-op/blob/main/src/main/java/com/yy/bigdata/orc/OpenFileLine.java]

After studying for a long time, a bad block recovery method has been realized. 
Two bad blocks (rs-3-1024) need to exclude datanode read the orc file for 10 
times, and then check whether the orc is legitimate. The recovery program is as 
follows, with changes to the source code, and the relevant jar is in the lib/ 
directory.

It would be best if a community official provided this feature.

> I have confirmed the EC corrupt file, can this corrupt file be restored?
> 
>
> Key: HDFS-17535
> URL: https://issues.apache.org/jira/browse/HDFS-17535
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, hdfs
>Affects Versions: 3.1.0
>Reporter: ruiliang
>Priority: Blocker
>
> I learned that EC does have a major bug with file corrupt
> https://issues.apache.org/jira/browse/HDFS-15759
> 1:I have confirmed the EC corrupt file, can this corrupt file be restored?
> Have important data that is causing us production data loss issues?   Is 
> there a way to recover
> corrupt;/file;corrupt block groups \{blk_-xx} zeroParityBlockGroups 
> \{blk_-xx[blk_-xx]}
> 2:https://github.com/apache/orc/issues/1939 I was wondering if cherry picked 
> your current code (GitHub pull request #2869), Can I skip patches related to 
> HDFS-14768,HDFS-15186, and HDFS-15240?
> hdfs  version 3.1.0
> thank you



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17547) debug verifyEC check error

2024-06-07 Thread ruiliang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ruiliang updated HDFS-17547:

 Attachment: image-2024-06-07-16-02-07-480.png
Description: 
When I validate a block that has been corrupted many times, does it appear 
normal?

 
{code:java}
hdfs  debug verifyEC  -file /file.orc
24/06/07 15:40:29 WARN erasurecode.ErasureCodeNative: ISA-L support is not 
available in your platform... using builtin-java codec where applicable
Checking EC block group: blk_-9223372036492703744
Status: OK
{code}
 

 

ByteBuffer hb show [0..] [0..]

!image-2024-06-07-16-02-07-480.png!
{code:java}
buffers = {ByteBuffer[5]@3270} 
 0 = {HeapByteBuffer@3430} "java.nio.HeapByteBuffer[pos=65536 lim=65536 
cap=65536]"
 1 = {HeapByteBuffer@3434} "java.nio.HeapByteBuffer[pos=65536 lim=65536 
cap=65536]"
 2 = {HeapByteBuffer@3438} "java.nio.HeapByteBuffer[pos=65536 lim=65536 
cap=65536]"
 3 = {HeapByteBuffer@3504} "java.nio.HeapByteBuffer[pos=0 lim=65536 cap=65536]"
  hb = {byte[65536]@3511} [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, +65,436 more]

buffers[this.dataBlkNum + ixx].equals(outputs[ixx] =true ?

outputs = {ByteBuffer[2]@3271} 
 0 = {HeapByteBuffer@3455} "java.nio.HeapByteBuffer[pos=0 lim=65536 cap=65536]"
  hb = {byte[65536]@3459} [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, +65,436 more]{code}
Can this situation be judged as an anomaly?

 

check orc file
{code:java}
Structure for skip_ip/_skip_file File Version: 0.12 with ORC_517 by ORC Java  
Exception in thread "main" java.io.IOException: Problem opening stripe 0 footer 
in skip_ip/_skip_file.         at 
org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:360)         
at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:879)         at 
org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:873)         at 
org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:345)         at 
org.apache.orc.tools.FileDump.printMetaData(FileDump.java:276)         at 
org.apache.orc.tools.FileDump.main(FileDump.java:137)         at 
org.apache.orc.tools.Driver.main(Driver.java:124) Caused by: 
java.lang.IllegalArgumentException: Buffer size too small. size = 131072 needed 
= 7752508 in column 3 kind LENGTH         at 
org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:481)     
    at 
org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:528)
         at 
org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:507)         
at 
org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:59)
         at 
org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:333)
         at 
org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryLengthStream(TreeReaderFactory.java:2221)
         at 
org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.startStripe(TreeReaderFactory.java:2201)
         at 
org.apache.orc.impl.TreeReaderFactory$StringTreeReader.startStripe(TreeReaderFactory.java:1943)
         at 
org.apache.orc.impl.reader.tree.StructBatchReader.startStripe(StructBatchReader.java:112)
         at 
org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1251)     
    at 
org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1290)  
       at 
org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1333)
         at 
org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:355)         
... 6 more
 {code}

  was:
When I validate a block that has been corrupted many times, does it appear 
normal?

 
{code:java}
hdfs  debug verifyEC  -file /file.orc
24/06/07 15:40:29 WARN erasurecode.ErasureCodeNative: ISA-L support is not 
available in your platform... using builtin-java codec where applicable
Checking EC block group: blk_-9223372036492703744
Status: OK
{code}
 

 

ByteBuffer hb show [0..]
{code:java}
buffers = {ByteBuffer[5]@3270} 
 0 = {HeapByteBuffer@3430} "java.nio.HeapByteBuffer[pos=65536 lim=65536 
cap=65536]"
 1 = {HeapByteBuffer@3434} "java.nio.HeapByteBuffer[pos=65536 lim=65536 
cap=65536]"
 2 = {HeapByteBuffer@3438} "java.nio.HeapByteBuffer[pos=65536 lim=65536 
cap=65536]"
 3 = {HeapByteBuffer@3504} "java.nio.HeapByteBuffer[pos=0 lim=65536 cap=65536]"
  hb = {byte[65536]@3511} [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 

[jira] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode

2024-06-07 Thread ruiliang (Jira)


[ https://issues.apache.org/jira/browse/HDFS-15759 ]


ruiliang deleted comment on HDFS-15759:
-

was (Author: ruilaing):
When I validate a block that has been corrupted many times, does it appear 
normal?

ByteBuffer hb show [0..]
{code:java}
buffers = {ByteBuffer[5]@3270} 
 0 = {HeapByteBuffer@3430} "java.nio.HeapByteBuffer[pos=65536 lim=65536 
cap=65536]"
 1 = {HeapByteBuffer@3434} "java.nio.HeapByteBuffer[pos=65536 lim=65536 
cap=65536]"
 2 = {HeapByteBuffer@3438} "java.nio.HeapByteBuffer[pos=65536 lim=65536 
cap=65536]"
 3 = {HeapByteBuffer@3504} "java.nio.HeapByteBuffer[pos=0 lim=65536 cap=65536]"
  hb = {byte[65536]@3511} [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, +65,436 more]

buffers[this.dataBlkNum + ixx].equals(outputs[ixx] =true ?

outputs = {ByteBuffer[2]@3271} 
 0 = {HeapByteBuffer@3455} "java.nio.HeapByteBuffer[pos=0 lim=65536 cap=65536]"
  hb = {byte[65536]@3459} [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, +65,436 more]{code}
Can this situation be judged as an anomaly?

 
{code:java}
hdfs  debug verifyEC  -file /file.orc
24/06/07 15:40:29 WARN erasurecode.ErasureCodeNative: ISA-L support is not 
available in your platform... using builtin-java codec where applicable
Checking EC block group: blk_-9223372036492703744
Status: OK
{code}
 

check orc file
{code:java}
Structure for skip_ip/_skip_file File Version: 0.12 with ORC_517 by ORC Java  
Exception in thread "main" java.io.IOException: Problem opening stripe 0 footer 
in skip_ip/_skip_file.         at 
org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:360)         
at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:879)         at 
org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:873)         at 
org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:345)         at 
org.apache.orc.tools.FileDump.printMetaData(FileDump.java:276)         at 
org.apache.orc.tools.FileDump.main(FileDump.java:137)         at 
org.apache.orc.tools.Driver.main(Driver.java:124) Caused by: 
java.lang.IllegalArgumentException: Buffer size too small. size = 131072 needed 
= 7752508 in column 3 kind LENGTH         at 
org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:481)     
    at 
org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:528)
         at 
org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:507)         
at 
org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:59)
         at 
org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:333)
         at 
org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryLengthStream(TreeReaderFactory.java:2221)
         at 
org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.startStripe(TreeReaderFactory.java:2201)
         at 
org.apache.orc.impl.TreeReaderFactory$StringTreeReader.startStripe(TreeReaderFactory.java:1943)
         at 
org.apache.orc.impl.reader.tree.StructBatchReader.startStripe(StructBatchReader.java:112)
         at 
org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1251)     
    at 
org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1290)  
       at 
org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1333)
         at 
org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:355)         
... 6 more
 {code}

> EC: Verify EC reconstruction correctness on DataNode
> 
>
> Key: HDFS-15759
> URL: https://issues.apache.org/jira/browse/HDFS-15759
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, ec, erasure-coding
>Affects Versions: 3.4.0
>Reporter: Toshihiko Uchida
>Assignee: Toshihiko Uchida
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0, 3.2.3
>
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> EC reconstruction on DataNode has caused data corruption: HDFS-14768, 
> HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and 
> the corruption is neither detected nor auto-healed by HDFS. It is obviously 
> hard for users to monitor data integrity by themselves, and even if they find 
> corrupted data, it is difficult or sometimes impossible to 

[jira] [Created] (HDFS-17547) debug verifyEC check error

2024-06-07 Thread ruiliang (Jira)
ruiliang created HDFS-17547:
---

 Summary: debug verifyEC check error
 Key: HDFS-17547
 URL: https://issues.apache.org/jira/browse/HDFS-17547
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-common
Reporter: ruiliang


When I validate a block that has been corrupted many times, does it appear 
normal?

 
{code:java}
hdfs  debug verifyEC  -file /file.orc
24/06/07 15:40:29 WARN erasurecode.ErasureCodeNative: ISA-L support is not 
available in your platform... using builtin-java codec where applicable
Checking EC block group: blk_-9223372036492703744
Status: OK
{code}
 

 

ByteBuffer hb show [0..]
{code:java}
buffers = {ByteBuffer[5]@3270} 
 0 = {HeapByteBuffer@3430} "java.nio.HeapByteBuffer[pos=65536 lim=65536 
cap=65536]"
 1 = {HeapByteBuffer@3434} "java.nio.HeapByteBuffer[pos=65536 lim=65536 
cap=65536]"
 2 = {HeapByteBuffer@3438} "java.nio.HeapByteBuffer[pos=65536 lim=65536 
cap=65536]"
 3 = {HeapByteBuffer@3504} "java.nio.HeapByteBuffer[pos=0 lim=65536 cap=65536]"
  hb = {byte[65536]@3511} [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, +65,436 more]

buffers[this.dataBlkNum + ixx].equals(outputs[ixx] =true ?

outputs = {ByteBuffer[2]@3271} 
 0 = {HeapByteBuffer@3455} "java.nio.HeapByteBuffer[pos=0 lim=65536 cap=65536]"
  hb = {byte[65536]@3459} [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, +65,436 more]{code}
Can this situation be judged as an anomaly?

 

check orc file
{code:java}
Structure for skip_ip/_skip_file File Version: 0.12 with ORC_517 by ORC Java  
Exception in thread "main" java.io.IOException: Problem opening stripe 0 footer 
in skip_ip/_skip_file.         at 
org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:360)         
at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:879)         at 
org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:873)         at 
org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:345)         at 
org.apache.orc.tools.FileDump.printMetaData(FileDump.java:276)         at 
org.apache.orc.tools.FileDump.main(FileDump.java:137)         at 
org.apache.orc.tools.Driver.main(Driver.java:124) Caused by: 
java.lang.IllegalArgumentException: Buffer size too small. size = 131072 needed 
= 7752508 in column 3 kind LENGTH         at 
org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:481)     
    at 
org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:528)
         at 
org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:507)         
at 
org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:59)
         at 
org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:333)
         at 
org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryLengthStream(TreeReaderFactory.java:2221)
         at 
org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.startStripe(TreeReaderFactory.java:2201)
         at 
org.apache.orc.impl.TreeReaderFactory$StringTreeReader.startStripe(TreeReaderFactory.java:1943)
         at 
org.apache.orc.impl.reader.tree.StructBatchReader.startStripe(StructBatchReader.java:112)
         at 
org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1251)     
    at 
org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1290)  
       at 
org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1333)
         at 
org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:355)         
... 6 more
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode

2024-06-07 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17853065#comment-17853065
 ] 

ruiliang edited comment on HDFS-15759 at 6/7/24 7:55 AM:
-

When I validate a block that has been corrupted many times, does it appear 
normal?

ByteBuffer hb show [0..]
{code:java}
buffers = {ByteBuffer[5]@3270} 
 0 = {HeapByteBuffer@3430} "java.nio.HeapByteBuffer[pos=65536 lim=65536 
cap=65536]"
 1 = {HeapByteBuffer@3434} "java.nio.HeapByteBuffer[pos=65536 lim=65536 
cap=65536]"
 2 = {HeapByteBuffer@3438} "java.nio.HeapByteBuffer[pos=65536 lim=65536 
cap=65536]"
 3 = {HeapByteBuffer@3504} "java.nio.HeapByteBuffer[pos=0 lim=65536 cap=65536]"
  hb = {byte[65536]@3511} [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, +65,436 more]

buffers[this.dataBlkNum + ixx].equals(outputs[ixx] =true ?

outputs = {ByteBuffer[2]@3271} 
 0 = {HeapByteBuffer@3455} "java.nio.HeapByteBuffer[pos=0 lim=65536 cap=65536]"
  hb = {byte[65536]@3459} [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, +65,436 more]{code}
Can this situation be judged as an anomaly?

 
{code:java}
hdfs  debug verifyEC  -file /file.orc
24/06/07 15:40:29 WARN erasurecode.ErasureCodeNative: ISA-L support is not 
available in your platform... using builtin-java codec where applicable
Checking EC block group: blk_-9223372036492703744
Status: OK
{code}
 

check orc file
{code:java}
Structure for skip_ip/_skip_file File Version: 0.12 with ORC_517 by ORC Java  
Exception in thread "main" java.io.IOException: Problem opening stripe 0 footer 
in skip_ip/_skip_file.         at 
org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:360)         
at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:879)         at 
org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:873)         at 
org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:345)         at 
org.apache.orc.tools.FileDump.printMetaData(FileDump.java:276)         at 
org.apache.orc.tools.FileDump.main(FileDump.java:137)         at 
org.apache.orc.tools.Driver.main(Driver.java:124) Caused by: 
java.lang.IllegalArgumentException: Buffer size too small. size = 131072 needed 
= 7752508 in column 3 kind LENGTH         at 
org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:481)     
    at 
org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:528)
         at 
org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:507)         
at 
org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:59)
         at 
org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:333)
         at 
org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryLengthStream(TreeReaderFactory.java:2221)
         at 
org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.startStripe(TreeReaderFactory.java:2201)
         at 
org.apache.orc.impl.TreeReaderFactory$StringTreeReader.startStripe(TreeReaderFactory.java:1943)
         at 
org.apache.orc.impl.reader.tree.StructBatchReader.startStripe(StructBatchReader.java:112)
         at 
org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1251)     
    at 
org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1290)  
       at 
org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1333)
         at 
org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:355)         
... 6 more
 {code}


was (Author: ruilaing):
When I validate a block that has been corrupted many times, does it appear 
normal?

ByteBuffer hb show [0..]
{code:java}
buffers = {ByteBuffer[5]@3270} 
 0 = {HeapByteBuffer@3430} "java.nio.HeapByteBuffer[pos=65536 lim=65536 
cap=65536]"
 1 = {HeapByteBuffer@3434} "java.nio.HeapByteBuffer[pos=65536 lim=65536 
cap=65536]"
 2 = {HeapByteBuffer@3438} "java.nio.HeapByteBuffer[pos=65536 lim=65536 
cap=65536]"
 3 = {HeapByteBuffer@3504} "java.nio.HeapByteBuffer[pos=0 lim=65536 cap=65536]"
  hb = {byte[65536]@3511} [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, +65,436 more]{code}
Can this situation be 

[jira] [Comment Edited] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode

2024-06-07 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17853065#comment-17853065
 ] 

ruiliang edited comment on HDFS-15759 at 6/7/24 7:54 AM:
-

When I validate a block that has been corrupted many times, does it appear 
normal?

ByteBuffer hb show [0..]
{code:java}
buffers = {ByteBuffer[5]@3270} 
 0 = {HeapByteBuffer@3430} "java.nio.HeapByteBuffer[pos=65536 lim=65536 
cap=65536]"
 1 = {HeapByteBuffer@3434} "java.nio.HeapByteBuffer[pos=65536 lim=65536 
cap=65536]"
 2 = {HeapByteBuffer@3438} "java.nio.HeapByteBuffer[pos=65536 lim=65536 
cap=65536]"
 3 = {HeapByteBuffer@3504} "java.nio.HeapByteBuffer[pos=0 lim=65536 cap=65536]"
  hb = {byte[65536]@3511} [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, +65,436 more]{code}
Can this situation be judged as an anomaly?

 
{code:java}
hdfs  debug verifyEC  -file /file.orc
24/06/07 15:40:29 WARN erasurecode.ErasureCodeNative: ISA-L support is not 
available in your platform... using builtin-java codec where applicable
Checking EC block group: blk_-9223372036492703744
Status: OK
{code}
 

check orc file
{code:java}
Structure for skip_ip/_skip_file File Version: 0.12 with ORC_517 by ORC Java  
Exception in thread "main" java.io.IOException: Problem opening stripe 0 footer 
in skip_ip/_skip_file.         at 
org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:360)         
at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:879)         at 
org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:873)         at 
org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:345)         at 
org.apache.orc.tools.FileDump.printMetaData(FileDump.java:276)         at 
org.apache.orc.tools.FileDump.main(FileDump.java:137)         at 
org.apache.orc.tools.Driver.main(Driver.java:124) Caused by: 
java.lang.IllegalArgumentException: Buffer size too small. size = 131072 needed 
= 7752508 in column 3 kind LENGTH         at 
org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:481)     
    at 
org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:528)
         at 
org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:507)         
at 
org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:59)
         at 
org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:333)
         at 
org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryLengthStream(TreeReaderFactory.java:2221)
         at 
org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.startStripe(TreeReaderFactory.java:2201)
         at 
org.apache.orc.impl.TreeReaderFactory$StringTreeReader.startStripe(TreeReaderFactory.java:1943)
         at 
org.apache.orc.impl.reader.tree.StructBatchReader.startStripe(StructBatchReader.java:112)
         at 
org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1251)     
    at 
org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1290)  
       at 
org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1333)
         at 
org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:355)         
... 6 more
 {code}


was (Author: ruilaing):
When I validate a block that has been corrupted many times, does it appear 
normal?

ByteBuffer hb show [0..]

!image-2024-06-07-15-52-26-294.png!

Can this situation be judged as an anomaly?

 
{code:java}
hdfs  debug verifyEC  -file /file.orc
24/06/07 15:40:29 WARN erasurecode.ErasureCodeNative: ISA-L support is not 
available in your platform... using builtin-java codec where applicable
Checking EC block group: blk_-9223372036492703744
Status: OK
{code}
 

check orc file
{code:java}
Structure for skip_ip/_skip_file File Version: 0.12 with ORC_517 by ORC Java  
Exception in thread "main" java.io.IOException: Problem opening stripe 0 footer 
in skip_ip/_skip_file.         at 
org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:360)         
at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:879)         at 
org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:873)         at 
org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:345)         at 
org.apache.orc.tools.FileDump.printMetaData(FileDump.java:276)         at 
org.apache.orc.tools.FileDump.main(FileDump.java:137)         at 
org.apache.orc.tools.Driver.main(Driver.java:124) Caused by: 
java.lang.IllegalArgumentException: Buffer size too small. size = 131072 needed 
= 7752508 in column 3 kind LENGTH         at 

[jira] [Commented] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode

2024-06-07 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17853065#comment-17853065
 ] 

ruiliang commented on HDFS-15759:
-

When I validate a block that has been corrupted many times, does it appear 
normal?

ByteBuffer hb show [0..]

!image-2024-06-07-15-52-26-294.png!

Can this situation be judged as an anomaly?

 
{code:java}
hdfs  debug verifyEC  -file /file.orc
24/06/07 15:40:29 WARN erasurecode.ErasureCodeNative: ISA-L support is not 
available in your platform... using builtin-java codec where applicable
Checking EC block group: blk_-9223372036492703744
Status: OK
{code}
 

check orc file
{code:java}
Structure for skip_ip/_skip_file File Version: 0.12 with ORC_517 by ORC Java  
Exception in thread "main" java.io.IOException: Problem opening stripe 0 footer 
in skip_ip/_skip_file.         at 
org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:360)         
at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:879)         at 
org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:873)         at 
org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:345)         at 
org.apache.orc.tools.FileDump.printMetaData(FileDump.java:276)         at 
org.apache.orc.tools.FileDump.main(FileDump.java:137)         at 
org.apache.orc.tools.Driver.main(Driver.java:124) Caused by: 
java.lang.IllegalArgumentException: Buffer size too small. size = 131072 needed 
= 7752508 in column 3 kind LENGTH         at 
org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:481)     
    at 
org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:528)
         at 
org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:507)         
at 
org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:59)
         at 
org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:333)
         at 
org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryLengthStream(TreeReaderFactory.java:2221)
         at 
org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.startStripe(TreeReaderFactory.java:2201)
         at 
org.apache.orc.impl.TreeReaderFactory$StringTreeReader.startStripe(TreeReaderFactory.java:1943)
         at 
org.apache.orc.impl.reader.tree.StructBatchReader.startStripe(StructBatchReader.java:112)
         at 
org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1251)     
    at 
org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1290)  
       at 
org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1333)
         at 
org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:355)         
... 6 more
 {code}

> EC: Verify EC reconstruction correctness on DataNode
> 
>
> Key: HDFS-15759
> URL: https://issues.apache.org/jira/browse/HDFS-15759
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, ec, erasure-coding
>Affects Versions: 3.4.0
>Reporter: Toshihiko Uchida
>Assignee: Toshihiko Uchida
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0, 3.2.3
>
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> EC reconstruction on DataNode has caused data corruption: HDFS-14768, 
> HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and 
> the corruption is neither detected nor auto-healed by HDFS. It is obviously 
> hard for users to monitor data integrity by themselves, and even if they find 
> corrupted data, it is difficult or sometimes impossible to recover them.
> To prevent further data corruption issues, this feature proposes a simple and 
> effective way to verify EC reconstruction correctness on DataNode at each 
> reconstruction process.
> It verifies correctness of outputs decoded from inputs as follows:
> 1. Decoding an input with the outputs;
> 2. Compare the decoded input with the original input.
> For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs 
> [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from 
> [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0.
> When an EC reconstruction task goes wrong, the comparison will fail with high 
> probability.
> Then the task will also fail and be retried by NameNode.
> The next reconstruction will succeed if the condition triggered the failure 
> is gone.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17535) I have confirmed the EC corrupt file, can this corrupt file be restored?

2024-05-24 Thread ruiliang (Jira)
ruiliang created HDFS-17535:
---

 Summary: I have confirmed the EC corrupt file, can this corrupt 
file be restored?
 Key: HDFS-17535
 URL: https://issues.apache.org/jira/browse/HDFS-17535
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ec, hdfs
Affects Versions: 3.1.0
Reporter: ruiliang


I learned that EC does have a major bug with file corrupt
https://issues.apache.org/jira/browse/HDFS-15759


1:I have confirmed the EC corrupt file, can this corrupt file be restored?
Have important data that is causing us production data loss issues?   Is there 
a way to recover
corrupt;/file;corrupt block groups \{blk_-xx} zeroParityBlockGroups 
\{blk_-xx[blk_-xx]}

2:https://github.com/apache/orc/issues/1939 I was wondering if cherry picked 
your current code (GitHub pull request #2869), Can I skip patches related to 
HDFS-14768,HDFS-15186, and HDFS-15240?


hdfs  version 3.1.0

thank you



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case

2024-05-24 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849217#comment-17849217
 ] 

ruiliang commented on HDFS-15186:
-

I have confirmed the EC corrupt file, can this corrupt file be restored?
Have important data that is causing us production data loss issues? Is there a 
way to recover
corrupt;/file;corrupt block groups \{blk_-xx} zeroParityBlockGroups 
\{blk_-xx[blk_-xx]}
hdfs  version 3.1.0

> Erasure Coding: Decommission may generate the parity block's content with all 
> 0 in some case
> 
>
> Key: HDFS-15186
> URL: https://issues.apache.org/jira/browse/HDFS-15186
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Yao Guangdong
>Assignee: Yao Guangdong
>Priority: Critical
> Fix For: 3.3.0
>
> Attachments: HDFS-15186.001.patch, HDFS-15186.002.patch, 
> HDFS-15186.003.patch, HDFS-15186.004.patch, HDFS-15186.005.patch
>
>
> # I can find some parity block's content with all 0 when i decommission some 
> DataNode(more than 1) from a cluster. And the probability is very big(parts 
> per thousand).This is a big problem.You can think that if we read data from 
> the zero parity block or use the zero parity block to recover a block which 
> can make us use the error data even we don't know it.
> There is some case in the below:
> B: Busy DataNode, 
> D:Decommissioning DataNode,
> Others is normal.
> 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 
> In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 
> 7, 8(D)], the DN may received reconstruct block command and the 
> liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which  in 
> the class StripedReconstructionInfo) length is 2. 
> The targets's length is 2 which mean that the DataNode need recover 2 
> internal block in current code.But from the liveIndices we only can find 1 
> missing block, so the method StripedWriter#initTargetIndices will use 0 as 
> the default recover block and don't care the indices 0 is in the sources 
> indices or not.
> When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] 
> use the ec algorithm.We can find that the indices [0] is in the both the 
> sources indices and the targets indices in this case. The returned target 
> buffer in the indices [6] is always 0 from the ec  algorithm.So I think this 
> is the ec algorithm's problem. Because it should more fault tolerance.I try 
> to fixed it .But it is too hard. Because the case is too more. The second is 
> another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to 
> recover indices [0, 6, 0]). So I changed my mind.Invoke the ec  algorithm 
> with a correct parameters. Which mean that remove the duplicate target 
> indices 0 in this case.Finally, I fixed it in this way.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode

2024-05-22 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848806#comment-17848806
 ] 

ruiliang edited comment on HDFS-15759 at 5/23/24 4:52 AM:
--

[~weichiu]

Hello, our current production data also has this kind of EC storage data damage 
problem, about the problem description
[https://github.com/apache/orc/issues/1939]
I was wondering if cherry picked your current code (GitHub pull request #2869),
Can I skip patches related to HDFS-14768,HDFS-15186, and HDFS-15240?

The current version of hdfs is 3.1.0.
Thank you!


was (Author: ruilaing):
Hello, our current production data also has this kind of EC storage data damage 
problem, about the problem description
https://github.com/apache/orc/issues/1939
I was wondering if cherry picked your current code (GitHub pull request #2869),
Can I skip patches related to HDFS-14768,HDFS-15186, and HDFS-15240?

The current version of hdfs is 3.1.0.
Thank you!

> EC: Verify EC reconstruction correctness on DataNode
> 
>
> Key: HDFS-15759
> URL: https://issues.apache.org/jira/browse/HDFS-15759
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, ec, erasure-coding
>Affects Versions: 3.4.0
>Reporter: Toshihiko Uchida
>Assignee: Toshihiko Uchida
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0, 3.2.3
>
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> EC reconstruction on DataNode has caused data corruption: HDFS-14768, 
> HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and 
> the corruption is neither detected nor auto-healed by HDFS. It is obviously 
> hard for users to monitor data integrity by themselves, and even if they find 
> corrupted data, it is difficult or sometimes impossible to recover them.
> To prevent further data corruption issues, this feature proposes a simple and 
> effective way to verify EC reconstruction correctness on DataNode at each 
> reconstruction process.
> It verifies correctness of outputs decoded from inputs as follows:
> 1. Decoding an input with the outputs;
> 2. Compare the decoded input with the original input.
> For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs 
> [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from 
> [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0.
> When an EC reconstruction task goes wrong, the comparison will fail with high 
> probability.
> Then the task will also fail and be retried by NameNode.
> The next reconstruction will succeed if the condition triggered the failure 
> is gone.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode

2024-05-22 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848806#comment-17848806
 ] 

ruiliang edited comment on HDFS-15759 at 5/23/24 3:53 AM:
--

Hello, our current production data also has this kind of EC storage data damage 
problem, about the problem description
https://github.com/apache/orc/issues/1939
I was wondering if cherry picked your current code (GitHub pull request #2869),
Can I skip patches related to HDFS-14768,HDFS-15186, and HDFS-15240?

The current version of hdfs is 3.1.0.
Thank you!


was (Author: ruilaing):
Hello, our current production data also has this kind of EC storage data damage 
problem, about the problem description
https://github.com/apache/orc/issues/1939
I was wondering if cherry picked your current code (GitHub pull request #2869),
Can I not repair the patches related to HDFS-14768,HDFS-15186, and HDFS-15240?
The current version of hdfs is 3.1.0.
Thank you!

> EC: Verify EC reconstruction correctness on DataNode
> 
>
> Key: HDFS-15759
> URL: https://issues.apache.org/jira/browse/HDFS-15759
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, ec, erasure-coding
>Affects Versions: 3.4.0
>Reporter: Toshihiko Uchida
>Assignee: Toshihiko Uchida
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0, 3.2.3
>
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> EC reconstruction on DataNode has caused data corruption: HDFS-14768, 
> HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and 
> the corruption is neither detected nor auto-healed by HDFS. It is obviously 
> hard for users to monitor data integrity by themselves, and even if they find 
> corrupted data, it is difficult or sometimes impossible to recover them.
> To prevent further data corruption issues, this feature proposes a simple and 
> effective way to verify EC reconstruction correctness on DataNode at each 
> reconstruction process.
> It verifies correctness of outputs decoded from inputs as follows:
> 1. Decoding an input with the outputs;
> 2. Compare the decoded input with the original input.
> For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs 
> [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from 
> [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0.
> When an EC reconstruction task goes wrong, the comparison will fail with high 
> probability.
> Then the task will also fail and be retried by NameNode.
> The next reconstruction will succeed if the condition triggered the failure 
> is gone.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode

2024-05-22 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848806#comment-17848806
 ] 

ruiliang edited comment on HDFS-15759 at 5/23/24 3:52 AM:
--

Hello, our current production data also has this kind of EC storage data damage 
problem, about the problem description
https://github.com/apache/orc/issues/1939
I was wondering if cherry picked your current code (GitHub pull request #2869),
Can I not repair the patches related to HDFS-14768,HDFS-15186, and HDFS-15240?
The current version of hdfs is 3.1.0.
Thank you!


was (Author: ruilaing):
Hello, our current production data also has this kind of EC storage data damage 
problem, about the problem description
https://github.com/apache/orc/issues/1939
I was wondering if cherry picked your current code (GitHub pull request #2869),
Can I not repair the patches related to HDFS-14768,HDFS-15186, and HDFS-15240?
The current version of hdfs is 3.1.0.
Thank you!

> EC: Verify EC reconstruction correctness on DataNode
> 
>
> Key: HDFS-15759
> URL: https://issues.apache.org/jira/browse/HDFS-15759
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, ec, erasure-coding
>Affects Versions: 3.4.0
>Reporter: Toshihiko Uchida
>Assignee: Toshihiko Uchida
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0, 3.2.3
>
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> EC reconstruction on DataNode has caused data corruption: HDFS-14768, 
> HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and 
> the corruption is neither detected nor auto-healed by HDFS. It is obviously 
> hard for users to monitor data integrity by themselves, and even if they find 
> corrupted data, it is difficult or sometimes impossible to recover them.
> To prevent further data corruption issues, this feature proposes a simple and 
> effective way to verify EC reconstruction correctness on DataNode at each 
> reconstruction process.
> It verifies correctness of outputs decoded from inputs as follows:
> 1. Decoding an input with the outputs;
> 2. Compare the decoded input with the original input.
> For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs 
> [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from 
> [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0.
> When an EC reconstruction task goes wrong, the comparison will fail with high 
> probability.
> Then the task will also fail and be retried by NameNode.
> The next reconstruction will succeed if the condition triggered the failure 
> is gone.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode

2024-05-22 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848806#comment-17848806
 ] 

ruiliang edited comment on HDFS-15759 at 5/23/24 3:51 AM:
--

Hello, our current production data also has this kind of EC storage data damage 
problem, about the problem description
https://github.com/apache/orc/issues/1939
I was wondering if cherry picked your current code (GitHub pull request #2869),
Can I not repair the patches related to HDFS-14768,HDFS-15186, and HDFS-15240?
The current version of hdfs is 3.1.0.
Thank you!


was (Author: ruilaing):
Hello, our current production data also has this kind of EC storage data damage 
problem, about the problem description
https://github.com/apache/orc/issues/1939
I would like to ask if cherry picked your current code (GitHub pull request 
#2869), can you skip the code to fix HDFS-14768,HDFS-15186 and HDFS-15240 
related patches?
The current version of hdfs is 3.1.0.
Thank you!

> EC: Verify EC reconstruction correctness on DataNode
> 
>
> Key: HDFS-15759
> URL: https://issues.apache.org/jira/browse/HDFS-15759
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, ec, erasure-coding
>Affects Versions: 3.4.0
>Reporter: Toshihiko Uchida
>Assignee: Toshihiko Uchida
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0, 3.2.3
>
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> EC reconstruction on DataNode has caused data corruption: HDFS-14768, 
> HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and 
> the corruption is neither detected nor auto-healed by HDFS. It is obviously 
> hard for users to monitor data integrity by themselves, and even if they find 
> corrupted data, it is difficult or sometimes impossible to recover them.
> To prevent further data corruption issues, this feature proposes a simple and 
> effective way to verify EC reconstruction correctness on DataNode at each 
> reconstruction process.
> It verifies correctness of outputs decoded from inputs as follows:
> 1. Decoding an input with the outputs;
> 2. Compare the decoded input with the original input.
> For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs 
> [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from 
> [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0.
> When an EC reconstruction task goes wrong, the comparison will fail with high 
> probability.
> Then the task will also fail and be retried by NameNode.
> The next reconstruction will succeed if the condition triggered the failure 
> is gone.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode

2024-05-22 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848806#comment-17848806
 ] 

ruiliang edited comment on HDFS-15759 at 5/23/24 3:50 AM:
--

Hello, our current production data also has this kind of EC storage data damage 
problem, about the problem description
https://github.com/apache/orc/issues/1939
I would like to ask if cherry picked your current code (GitHub pull request 
#2869), can you skip the code to fix HDFS-14768,HDFS-15186 and HDFS-15240 
related patches?
The current version of hdfs is 3.1.0.
Thank you!


was (Author: ruilaing):
Hello, our current online data also appears this kind of EC storage data damage 
problem, about the problem description 
https://github.com/apache/orc/issues/1939
I would like to ask if cherry picked your current code (GitHub pull request 
#2869), can you skip the code to fix HDFS-14768,HDFS-15186 and HDFS-15240 
related patches?
The current version of hdfs is 3.1.0.
Thank you!

> EC: Verify EC reconstruction correctness on DataNode
> 
>
> Key: HDFS-15759
> URL: https://issues.apache.org/jira/browse/HDFS-15759
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, ec, erasure-coding
>Affects Versions: 3.4.0
>Reporter: Toshihiko Uchida
>Assignee: Toshihiko Uchida
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0, 3.2.3
>
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> EC reconstruction on DataNode has caused data corruption: HDFS-14768, 
> HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and 
> the corruption is neither detected nor auto-healed by HDFS. It is obviously 
> hard for users to monitor data integrity by themselves, and even if they find 
> corrupted data, it is difficult or sometimes impossible to recover them.
> To prevent further data corruption issues, this feature proposes a simple and 
> effective way to verify EC reconstruction correctness on DataNode at each 
> reconstruction process.
> It verifies correctness of outputs decoded from inputs as follows:
> 1. Decoding an input with the outputs;
> 2. Compare the decoded input with the original input.
> For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs 
> [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from 
> [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0.
> When an EC reconstruction task goes wrong, the comparison will fail with high 
> probability.
> Then the task will also fail and be retried by NameNode.
> The next reconstruction will succeed if the condition triggered the failure 
> is gone.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode

2024-05-22 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848806#comment-17848806
 ] 

ruiliang commented on HDFS-15759:
-

Hello, our current online data also appears this kind of EC storage data damage 
problem, about the problem description 
https://github.com/apache/orc/issues/1939
I would like to ask if cherry picked your current code (GitHub pull request 
#2869), can you skip the code to fix HDFS-14768,HDFS-15186 and HDFS-15240 
related patches?
The current version of hdfs is 3.1.0.
Thank you!

> EC: Verify EC reconstruction correctness on DataNode
> 
>
> Key: HDFS-15759
> URL: https://issues.apache.org/jira/browse/HDFS-15759
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, ec, erasure-coding
>Affects Versions: 3.4.0
>Reporter: Toshihiko Uchida
>Assignee: Toshihiko Uchida
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0, 3.2.3
>
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> EC reconstruction on DataNode has caused data corruption: HDFS-14768, 
> HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and 
> the corruption is neither detected nor auto-healed by HDFS. It is obviously 
> hard for users to monitor data integrity by themselves, and even if they find 
> corrupted data, it is difficult or sometimes impossible to recover them.
> To prevent further data corruption issues, this feature proposes a simple and 
> effective way to verify EC reconstruction correctness on DataNode at each 
> reconstruction process.
> It verifies correctness of outputs decoded from inputs as follows:
> 1. Decoding an input with the outputs;
> 2. Compare the decoded input with the original input.
> For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs 
> [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from 
> [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0.
> When an EC reconstruction task goes wrong, the comparison will fail with high 
> probability.
> Then the task will also fail and be retried by NameNode.
> The next reconstruction will succeed if the condition triggered the failure 
> is gone.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-17407) Exception during image upload

2024-03-07 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824329#comment-17824329
 ] 

ruiliang edited comment on HDFS-17407 at 3/7/24 9:29 AM:
-

After analyzing the log and source code, it is because the two sbnn initiated 
Checkpoint at the same time. When the latter checked the file flow, it found 
that the file had been updated and threw an exception. Should not output as an 
exception?

SbNN 1 log

 
{code:java}
root@cluster06-yynn1:/data/logs/hadoop/hdfs# grep 57258734311 
hadoop-hdfs-namenode-cluster06-nn1.xx.com.log 
2024-03-07 16:48:00,061 INFO  namenode.FSImage (FSImage.java:loadEdits(887)) - 
Reading 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@4afc4056 
expecting start txid #57258734311
2024-03-07 16:48:00,061 INFO  namenode.FSImage 
(FSEditLogLoader.java:loadFSEdits(158)) - Start loading edits file 
http://fs-nn-party-65-190.xx.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true,
 
http://fs-nn-party-65-191.xx.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true
 maxTxnsToRead = 9223372036854775807
2024-03-07 16:48:00,061 INFO  namenode.RedundantEditLogInputStream 
(RedundantEditLogInputStream.java:nextOp(177)) - Fast-forwarding stream 
'http://fs-nn-party-65-190.xx.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true,
 
http://fs-nn-party-65-191.xx.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true'
 to transaction ID 57258734311
2024-03-07 16:48:00,061 INFO  namenode.RedundantEditLogInputStream 
(RedundantEditLogInputStream.java:nextOp(177)) - Fast-forwarding stream 
'http://fs-nn-party-65-190.xx.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true'
 to transaction ID 57258734311
2024-03-07 16:48:02,592 INFO  namenode.FSImage 
(FSEditLogLoader.java:loadFSEdits(162)) - Edits file 
http://fs-nn-party-65-190.xx.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true,
 
http://fs-nn-party-65-191.xx.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true
 of size 35380849 edits # 214398 loaded in 2 seconds {code}
SbNN 2 log

 
{code:java}
root@cluster06-yynn3:/data/logs/hadoop/hdfs# grep 57258734311 
hadoop-hdfs-namenode-cluster06-nn3.xx.com.log
2024-03-07 16:48:32,536 INFO  namenode.FSImage (FSImage.java:loadEdits(887)) - 
Reading 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@6d0659cd 
expecting start txid #57258734311
2024-03-07 16:48:32,536 INFO  namenode.FSImage 
(FSEditLogLoader.java:loadFSEdits(158)) - Start loading edits file 
http://fs-nn-party-65-191.xx.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true,
 
http://fs-nn-party-65-190.xx.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true
 maxTxnsToRead = 9223372036854775807
2024-03-07 16:48:32,536 INFO  namenode.RedundantEditLogInputStream 
(RedundantEditLogInputStream.java:nextOp(177)) - Fast-forwarding stream 
'http://fs-nn-party-65-191.xx.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true,
 
http://fs-nn-party-65-190.xx.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true'
 to transaction ID 57258734311
2024-03-07 16:48:32,536 INFO  namenode.RedundantEditLogInputStream 
(RedundantEditLogInputStream.java:nextOp(177)) - Fast-forwarding stream 
'http://fs-nn-party-65-191.xxcom:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true'
 to transaction ID 57258734311
2024-03-07 16:48:35,634 INFO  namenode.FSImage 
(FSEditLogLoader.java:loadFSEdits(162)) - Edits file 
http://fs-nn-party-65-191.xx.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true,
 
http://fs-nn-party-65-190.xx.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true
 of size 35380849 edits # 214398 loaded in 3 seconds 
...

2024-03-07 16:48:32,547 INFO  namenode.TransferFsImage 
(TransferFsImage.java:copyFileToStream(394)) - Sending fileName: 
/data/hadoop/hdfs/namenode/current/fsimage_57258734310, fileSize: 
4811881207. Sent total: 2228224 bytes. Size of last segment 

[jira] [Comment Edited] (HDFS-17407) Exception during image upload

2024-03-07 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824329#comment-17824329
 ] 

ruiliang edited comment on HDFS-17407 at 3/7/24 9:26 AM:
-

After analyzing the log and source code, it is because the two sbnn initiated 
Checkpoint at the same time. When the latter checked the file flow, it found 
that the file had been updated and threw an exception. Should not output as an 
exception?

SbNN 1 log

 
{code:java}
root@cluster06-yynn1:/data/logs/hadoop/hdfs# grep 57258734311 
hadoop-hdfs-namenode-cluster06-yynn1.xx.com.log 
2024-03-07 16:48:00,061 INFO  namenode.FSImage (FSImage.java:loadEdits(887)) - 
Reading 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@4afc4056 
expecting start txid #57258734311
2024-03-07 16:48:00,061 INFO  namenode.FSImage 
(FSEditLogLoader.java:loadFSEdits(158)) - Start loading edits file 
http://fs-nn-party-65-190.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true,
 
http://fs-nn-party-65-191.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true
 maxTxnsToRead = 9223372036854775807
2024-03-07 16:48:00,061 INFO  namenode.RedundantEditLogInputStream 
(RedundantEditLogInputStream.java:nextOp(177)) - Fast-forwarding stream 
'http://fs-nn-party-65-190.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true,
 
http://fs-nn-party-65-191.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true'
 to transaction ID 57258734311
2024-03-07 16:48:00,061 INFO  namenode.RedundantEditLogInputStream 
(RedundantEditLogInputStream.java:nextOp(177)) - Fast-forwarding stream 
'http://fs-nn-party-65-190.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true'
 to transaction ID 57258734311
2024-03-07 16:48:02,592 INFO  namenode.FSImage 
(FSEditLogLoader.java:loadFSEdits(162)) - Edits file 
http://fs-nn-party-65-190.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true,
 
http://fs-nn-party-65-191.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true
 of size 35380849 edits # 214398 loaded in 2 seconds {code}
SbNN 2 log

 
{code:java}
root@cluster06-yynn3:/data/logs/hadoop/hdfs# grep 57258734311 
hadoop-hdfs-namenode-cluster06-yynn3.xx.com.log
2024-03-07 16:48:32,536 INFO  namenode.FSImage (FSImage.java:loadEdits(887)) - 
Reading 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@6d0659cd 
expecting start txid #57258734311
2024-03-07 16:48:32,536 INFO  namenode.FSImage 
(FSEditLogLoader.java:loadFSEdits(158)) - Start loading edits file 
http://fs-nn-party-65-191.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true,
 
http://fs-nn-party-65-190.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true
 maxTxnsToRead = 9223372036854775807
2024-03-07 16:48:32,536 INFO  namenode.RedundantEditLogInputStream 
(RedundantEditLogInputStream.java:nextOp(177)) - Fast-forwarding stream 
'http://fs-nn-party-65-191.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true,
 
http://fs-nn-party-65-190.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true'
 to transaction ID 57258734311
2024-03-07 16:48:32,536 INFO  namenode.RedundantEditLogInputStream 
(RedundantEditLogInputStream.java:nextOp(177)) - Fast-forwarding stream 
'http://fs-nn-party-65-191.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true'
 to transaction ID 57258734311
2024-03-07 16:48:35,634 INFO  namenode.FSImage 
(FSEditLogLoader.java:loadFSEdits(162)) - Edits file 
http://fs-nn-party-65-191.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true,
 
http://fs-nn-party-65-190.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true
 of size 35380849 edits # 214398 loaded in 3 seconds 
...

2024-03-07 

[jira] [Updated] (HDFS-17407) Exception during image upload

2024-03-07 Thread ruiliang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ruiliang updated HDFS-17407:

Issue Type: Improvement  (was: Bug)

> Exception during image upload
> -
>
> Key: HDFS-17407
> URL: https://issues.apache.org/jira/browse/HDFS-17407
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namanode
>Affects Versions: 3.1.0
> Environment: hadoop 3.1.0 
> linux:ubuntu 16.04
> ambari-hdp:3.1.1
>Reporter: ruiliang
>Priority: Major
>
> After I added the third hdfs namenode, the service was fine. However, the two 
> Standby namenode service logs always show exceptions during image upload. 
> However, I observe that the image file of the primary node is being updated 
> normally, which indicates that the secondary node has merged the image file 
> and uploaded it to the primary node. But I don't understand why two Standby 
> namenode keep sending such exception logs. Are there potential risk issues?
>  
> namenode log 
> {code:java}
> 2024-03-01 15:31:46,162 INFO  namenode.TransferFsImage 
> (TransferFsImage.java:copyFileToStream(394)) - Sending fileName: 
> /data/hadoop/hdfs/namenode/current/fsimage_55689095810, fileSize: 
> 4626167848. Sent total: 1703936 bytes. Size of last segment intended to send: 
> 131072 bytes.
> java.io.IOException: Error writing request body to server
>         at 
> sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3587)
>         at 
> sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3570)
>         at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.copyFileToStream(TransferFsImage.java:376)
>         at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.writeFileToPutRequest(TransferFsImage.java:320)
>         at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:294)
>         at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImageFromStorage(TransferFsImage.java:229)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:236)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:231)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> 2024-03-01 15:31:46,630 INFO  blockmanagement.BlockManager 
> (BlockManager.java:enqueue(4923)) - Block report queue is full
> 2024-03-01 15:31:46,664 ERROR ha.StandbyCheckpointer 
> (StandbyCheckpointer.java:doWork(452)) - Exception in doCheckpoint
> java.io.IOException: Exception during image upload
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:257)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1500(StandbyCheckpointer.java:62)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:432)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$600(StandbyCheckpointer.java:331)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:351)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:360)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1710)
>         at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:480)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:347)
> Caused by: java.util.concurrent.ExecutionException: java.io.IOException: 
> Error writing request body to server
>         at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>         at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:250)
>         ... 9 more
> Caused by: java.io.IOException: Error writing request body to server
>         at 
> sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3587)
>         at 
> sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3570)
>         at 
> 

[jira] [Commented] (HDFS-17407) Exception during image upload

2024-03-07 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824329#comment-17824329
 ] 

ruiliang commented on HDFS-17407:
-

After analyzing the log and source code, it is because the two sbnn initiated 
Checkpoint at the same time. When the latter checked the file flow, it found 
that the file had been updated and threw an exception. Should not output as an 
exception?

SbNN 1 log

 
{code:java}
root@fs-hiido-yycluster06-yynn1:/data/logs/hadoop/hdfs# grep 57258734311 
hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn1.hiido.host.yydevops.com.log 
2024-03-07 16:48:00,061 INFO  namenode.FSImage (FSImage.java:loadEdits(887)) - 
Reading 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@4afc4056 
expecting start txid #57258734311
2024-03-07 16:48:00,061 INFO  namenode.FSImage 
(FSEditLogLoader.java:loadFSEdits(158)) - Start loading edits file 
http://fs-nn-party-65-190.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true,
 
http://fs-nn-party-65-191.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true
 maxTxnsToRead = 9223372036854775807
2024-03-07 16:48:00,061 INFO  namenode.RedundantEditLogInputStream 
(RedundantEditLogInputStream.java:nextOp(177)) - Fast-forwarding stream 
'http://fs-nn-party-65-190.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true,
 
http://fs-nn-party-65-191.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true'
 to transaction ID 57258734311
2024-03-07 16:48:00,061 INFO  namenode.RedundantEditLogInputStream 
(RedundantEditLogInputStream.java:nextOp(177)) - Fast-forwarding stream 
'http://fs-nn-party-65-190.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true'
 to transaction ID 57258734311
2024-03-07 16:48:02,592 INFO  namenode.FSImage 
(FSEditLogLoader.java:loadFSEdits(162)) - Edits file 
http://fs-nn-party-65-190.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true,
 
http://fs-nn-party-65-191.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true
 of size 35380849 edits # 214398 loaded in 2 seconds {code}
SbNN 2 log

 
{code:java}
root@fs-hiido-yycluster06-yynn3:/data/logs/hadoop/hdfs# grep 57258734311 
hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.int.yy.com.log
2024-03-07 16:48:32,536 INFO  namenode.FSImage (FSImage.java:loadEdits(887)) - 
Reading 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@6d0659cd 
expecting start txid #57258734311
2024-03-07 16:48:32,536 INFO  namenode.FSImage 
(FSEditLogLoader.java:loadFSEdits(158)) - Start loading edits file 
http://fs-nn-party-65-191.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true,
 
http://fs-nn-party-65-190.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true
 maxTxnsToRead = 9223372036854775807
2024-03-07 16:48:32,536 INFO  namenode.RedundantEditLogInputStream 
(RedundantEditLogInputStream.java:nextOp(177)) - Fast-forwarding stream 
'http://fs-nn-party-65-191.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true,
 
http://fs-nn-party-65-190.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true'
 to transaction ID 57258734311
2024-03-07 16:48:32,536 INFO  namenode.RedundantEditLogInputStream 
(RedundantEditLogInputStream.java:nextOp(177)) - Fast-forwarding stream 
'http://fs-nn-party-65-191.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true'
 to transaction ID 57258734311
2024-03-07 16:48:35,634 INFO  namenode.FSImage 
(FSEditLogLoader.java:loadFSEdits(162)) - Edits file 
http://fs-nn-party-65-191.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true,
 
http://fs-nn-party-65-190.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06=57258734311=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c=true
 of size 35380849 edits # 214398 loaded in 3 

[jira] [Resolved] (HDFS-16799) The dn space size is not consistent, and Balancer can not work, resulting in a very unbalanced space

2024-03-06 Thread ruiliang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ruiliang resolved HDFS-16799.
-
Resolution: Cannot Reproduce

> The dn space size is not consistent, and Balancer can not work, resulting in 
> a very unbalanced space
> 
>
> Key: HDFS-16799
> URL: https://issues.apache.org/jira/browse/HDFS-16799
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.0
>Reporter: ruiliang
>Priority: Blocker
>
>  
> {code:java}
> echo 'A DFS Used 99.8% to ip' > sorucehost  
> hdfs --debug  balancer  -fs hdfs://xxcluster06  -threshold 10 -source -f 
> sorucehost  
> 
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-01-08/10.12.65.243:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-01-08/10.12.65.247:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-15-10/10.12.65.214:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-02-08/10.12.14.8:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-05-13/10.12.15.154:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-04/10.12.65.218:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-03/10.12.65.143:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-05-05/10.12.12.200:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-03/10.12.65.217:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-03/10.12.65.142:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-01-08/10.12.65.246:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-03/10.12.65.219:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-03/10.12.65.147:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-15-10/10.12.65.186:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-05-13/10.12.15.153:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-03-07/10.12.19.23:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-04-14/10.12.65.119:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-03/10.12.65.131:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-05-04/10.12.12.210:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-05-11/10.12.14.168:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-01-08/10.12.65.245:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-03-02/10.12.17.26:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-01-08/10.12.65.241:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-05-13/10.12.15.152:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-01-08/10.12.65.249:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-07-14/10.12.64.71:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-03-03/10.12.17.35:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-01-08/10.12.65.195:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-01-08/10.12.65.242:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-01-08/10.12.65.248:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-01-08/10.12.65.240:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-15-12/10.12.65.196:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-05-13/10.12.15.150:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-03/10.12.65.222:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-03/10.12.65.145:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-01-08/10.12.65.244:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-03-07/10.12.19.22:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-03/10.12.65.221:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-03/10.12.65.136:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-03/10.12.65.129:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-05-15/10.12.15.163:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-07-14/10.12.64.72:1019
> 

[jira] [Updated] (HDFS-17407) Exception during image upload

2024-03-06 Thread ruiliang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ruiliang updated HDFS-17407:

Description: 
After I added the third hdfs namenode, the service was fine. However, the two 
Standby namenode service logs always show exceptions during image upload. 
However, I observe that the image file of the primary node is being updated 
normally, which indicates that the secondary node has merged the image file and 
uploaded it to the primary node. But I don't understand why two Standby 
namenode keep sending such exception logs. Are there potential risk issues?

 

namenode log 
{code:java}
2024-03-01 15:31:46,162 INFO  namenode.TransferFsImage 
(TransferFsImage.java:copyFileToStream(394)) - Sending fileName: 
/data/hadoop/hdfs/namenode/current/fsimage_55689095810, fileSize: 
4626167848. Sent total: 1703936 bytes. Size of last segment intended to send: 
131072 bytes.
java.io.IOException: Error writing request body to server
        at 
sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3587)
        at 
sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3570)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.copyFileToStream(TransferFsImage.java:376)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.writeFileToPutRequest(TransferFsImage.java:320)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:294)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImageFromStorage(TransferFsImage.java:229)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:236)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:231)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2024-03-01 15:31:46,630 INFO  blockmanagement.BlockManager 
(BlockManager.java:enqueue(4923)) - Block report queue is full
2024-03-01 15:31:46,664 ERROR ha.StandbyCheckpointer 
(StandbyCheckpointer.java:doWork(452)) - Exception in doCheckpoint
java.io.IOException: Exception during image upload
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:257)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1500(StandbyCheckpointer.java:62)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:432)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$600(StandbyCheckpointer.java:331)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:351)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:360)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1710)
        at 
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:480)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:347)
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Error 
writing request body to server
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)
        at java.util.concurrent.FutureTask.get(FutureTask.java:192)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:250)
        ... 9 more
Caused by: java.io.IOException: Error writing request body to server
        at 
sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3587)
        at 
sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3570)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.copyFileToStream(TransferFsImage.java:376)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.writeFileToPutRequest(TransferFsImage.java:320)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:294)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImageFromStorage(TransferFsImage.java:229)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:236)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:231)
        at 

[jira] [Updated] (HDFS-17407) Exception during image upload

2024-03-06 Thread ruiliang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ruiliang updated HDFS-17407:

Description: 
After I added the third hdfs namenode, the service was fine. However, the two 
Standby namenode service logs always show exceptions during image upload. 
However, I observe that the image file of the primary node is being updated 
normally, which indicates that the secondary node has merged the image file and 
uploaded it to the primary node. But I don't understand why two Standby 
namenode keep sending such exception logs. Are there potential risk issues?

 

 
{code:java}
2024-03-01 15:31:46,162 INFO  namenode.TransferFsImage 
(TransferFsImage.java:copyFileToStream(394)) - Sending fileName: 
/data/hadoop/hdfs/namenode/current/fsimage_55689095810, fileSize: 
4626167848. Sent total: 1703936 bytes. Size of last segment intended to send: 
131072 bytes.
java.io.IOException: Error writing request body to server
        at 
sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3587)
        at 
sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3570)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.copyFileToStream(TransferFsImage.java:376)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.writeFileToPutRequest(TransferFsImage.java:320)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:294)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImageFromStorage(TransferFsImage.java:229)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:236)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:231)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2024-03-01 15:31:46,630 INFO  blockmanagement.BlockManager 
(BlockManager.java:enqueue(4923)) - Block report queue is full
2024-03-01 15:31:46,664 ERROR ha.StandbyCheckpointer 
(StandbyCheckpointer.java:doWork(452)) - Exception in doCheckpoint
java.io.IOException: Exception during image upload
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:257)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1500(StandbyCheckpointer.java:62)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:432)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$600(StandbyCheckpointer.java:331)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:351)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:360)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1710)
        at 
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:480)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:347)
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Error 
writing request body to server
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)
        at java.util.concurrent.FutureTask.get(FutureTask.java:192)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:250)
        ... 9 more
Caused by: java.io.IOException: Error writing request body to server
        at 
sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3587)
        at 
sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3570)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.copyFileToStream(TransferFsImage.java:376)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.writeFileToPutRequest(TransferFsImage.java:320)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:294)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImageFromStorage(TransferFsImage.java:229)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:236)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:231)
        at 

[jira] [Created] (HDFS-17407) Exception during image upload

2024-02-29 Thread ruiliang (Jira)
ruiliang created HDFS-17407:
---

 Summary: Exception during image upload
 Key: HDFS-17407
 URL: https://issues.apache.org/jira/browse/HDFS-17407
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namanode
Affects Versions: 3.1.0
 Environment: hadoop 3.1.0 

linux:ubuntu 16.04

ambari-hdp:3.1.1
Reporter: ruiliang


After I added the third hdfs namenode, the service was fine. However, the two 
Standby namenode service logs always show exceptions during image upload. 
However, I observe that the image file of the primary node is being updated 
normally, which indicates that the secondary node has merged the image file and 
uploaded it to the primary node. But I don't understand why two Standby 
namenode keep sending such exception logs. Are there potential risk issues?

 

 
{code:java}
2024-03-01 15:31:46,162 INFO  namenode.TransferFsImage 
(TransferFsImage.java:copyFileToStream(394)) - Sending fileName: 
/data/hadoop/hdfs/namenode/current/fsimage_55689095810, fileSize: 
4626167848. Sent total: 1703936 bytes. Size of last segment intended to send: 
131072 bytes.
java.io.IOException: Error writing request body to server
        at 
sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3587)
        at 
sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3570)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.copyFileToStream(TransferFsImage.java:376)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.writeFileToPutRequest(TransferFsImage.java:320)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:294)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImageFromStorage(TransferFsImage.java:229)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:236)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:231)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2024-03-01 15:31:46,630 INFO  blockmanagement.BlockManager 
(BlockManager.java:enqueue(4923)) - Block report queue is full
2024-03-01 15:31:46,664 ERROR ha.StandbyCheckpointer 
(StandbyCheckpointer.java:doWork(452)) - Exception in doCheckpoint
java.io.IOException: Exception during image upload
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:257)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1500(StandbyCheckpointer.java:62)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:432)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$600(StandbyCheckpointer.java:331)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:351)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:360)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1710)
        at 
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:480)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:347)
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Error 
writing request body to server
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)
        at java.util.concurrent.FutureTask.get(FutureTask.java:192)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:250)
        ... 9 more
Caused by: java.io.IOException: Error writing request body to server
        at 
sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3587)
        at 
sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3570)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.copyFileToStream(TransferFsImage.java:376)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.writeFileToPutRequest(TransferFsImage.java:320)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:294)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImageFromStorage(TransferFsImage.java:229)
        at 

[jira] [Commented] (HDFS-7343) HDFS smart storage management

2023-02-20 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-7343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691365#comment-17691365
 ] 

ruiliang commented on HDFS-7343:


https://github.com/Intel-bigdata/SSM
This repository has been archived by the owner on Jan 4, 2023. It is now 
read-only.
 
Is this item still available? 

Why not develop it? 

Or did something else take over? 

thank you
 

> HDFS smart storage management
> -
>
> Key: HDFS-7343
> URL: https://issues.apache.org/jira/browse/HDFS-7343
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Kai Zheng
>Assignee: Wei Zhou
>Priority: Major
> Attachments: HDFS-Smart-Storage-Management-update.pdf, 
> HDFS-Smart-Storage-Management.pdf, 
> HDFSSmartStorageManagement-General-20170315.pdf, 
> HDFSSmartStorageManagement-Phase1-20170315.pdf, access_count_tables.jpg, 
> move.jpg, tables_in_ssm.xlsx
>
>
> As discussed in HDFS-7285, it would be better to have a comprehensive and 
> flexible storage policy engine considering file attributes, metadata, data 
> temperature, storage type, EC codec, available hardware capabilities, 
> user/application preference and etc.
> Modified the title for re-purpose.
> We'd extend this effort some bit and aim to work on a comprehensive solution 
> to provide smart storage management service in order for convenient, 
> intelligent and effective utilizing of erasure coding or replicas, HDFS cache 
> facility, HSM offering, and all kinds of tools (balancer, mover, disk 
> balancer and so on) in a large cluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] (HDFS-16799) The dn space size is not consistent, and Balancer can not work, resulting in a very unbalanced space

2023-02-15 Thread ruiliang (Jira)


[ https://issues.apache.org/jira/browse/HDFS-16799 ]


ruiliang deleted comment on HDFS-16799:
-

was (Author: ruilaing):
ok

> The dn space size is not consistent, and Balancer can not work, resulting in 
> a very unbalanced space
> 
>
> Key: HDFS-16799
> URL: https://issues.apache.org/jira/browse/HDFS-16799
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.0
>Reporter: ruiliang
>Priority: Blocker
>
>  
> {code:java}
> echo 'A DFS Used 99.8% to ip' > sorucehost  
> hdfs --debug  balancer  -fs hdfs://xxcluster06  -threshold 10 -source -f 
> sorucehost  
> 
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-01-08/10.12.65.243:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-01-08/10.12.65.247:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-15-10/10.12.65.214:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-02-08/10.12.14.8:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-05-13/10.12.15.154:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-04/10.12.65.218:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-03/10.12.65.143:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-05-05/10.12.12.200:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-03/10.12.65.217:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-03/10.12.65.142:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-01-08/10.12.65.246:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-03/10.12.65.219:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-03/10.12.65.147:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-15-10/10.12.65.186:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-05-13/10.12.15.153:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-03-07/10.12.19.23:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-04-14/10.12.65.119:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-03/10.12.65.131:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-05-04/10.12.12.210:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-05-11/10.12.14.168:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-01-08/10.12.65.245:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-03-02/10.12.17.26:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-01-08/10.12.65.241:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-05-13/10.12.15.152:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-01-08/10.12.65.249:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-07-14/10.12.64.71:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-03-03/10.12.17.35:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-01-08/10.12.65.195:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-01-08/10.12.65.242:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-01-08/10.12.65.248:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-01-08/10.12.65.240:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-15-12/10.12.65.196:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-05-13/10.12.15.150:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-03/10.12.65.222:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-03/10.12.65.145:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-01-08/10.12.65.244:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-03-07/10.12.19.22:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-03/10.12.65.221:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-03/10.12.65.136:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-03/10.12.65.129:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-05-15/10.12.15.163:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-07-14/10.12.64.72:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a 

[jira] [Commented] (HDFS-16799) The dn space size is not consistent, and Balancer can not work, resulting in a very unbalanced space

2023-02-15 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688914#comment-17688914
 ] 

ruiliang commented on HDFS-16799:
-

ok

> The dn space size is not consistent, and Balancer can not work, resulting in 
> a very unbalanced space
> 
>
> Key: HDFS-16799
> URL: https://issues.apache.org/jira/browse/HDFS-16799
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.0
>Reporter: ruiliang
>Priority: Blocker
>
>  
> {code:java}
> echo 'A DFS Used 99.8% to ip' > sorucehost  
> hdfs --debug  balancer  -fs hdfs://xxcluster06  -threshold 10 -source -f 
> sorucehost  
> 
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-01-08/10.12.65.243:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-01-08/10.12.65.247:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-15-10/10.12.65.214:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-02-08/10.12.14.8:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-05-13/10.12.15.154:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-04/10.12.65.218:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-03/10.12.65.143:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-05-05/10.12.12.200:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-03/10.12.65.217:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-03/10.12.65.142:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-01-08/10.12.65.246:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-03/10.12.65.219:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-03/10.12.65.147:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-15-10/10.12.65.186:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-05-13/10.12.15.153:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-03-07/10.12.19.23:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-04-14/10.12.65.119:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-03/10.12.65.131:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-05-04/10.12.12.210:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-05-11/10.12.14.168:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-01-08/10.12.65.245:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-03-02/10.12.17.26:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-01-08/10.12.65.241:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-05-13/10.12.15.152:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-01-08/10.12.65.249:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-07-14/10.12.64.71:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-03-03/10.12.17.35:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-01-08/10.12.65.195:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-01-08/10.12.65.242:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-01-08/10.12.65.248:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-01-08/10.12.65.240:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-15-12/10.12.65.196:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-05-13/10.12.15.150:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-03/10.12.65.222:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-03/10.12.65.145:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-01-08/10.12.65.244:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-03-07/10.12.19.22:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-03/10.12.65.221:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-03/10.12.65.136:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-12-03/10.12.65.129:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> /4F08-05-15/10.12.15.163:1019
> 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
> 

[jira] [Resolved] (HDFS-16806) ec data balancer block blk_id The index error ,Data cannot be moved

2022-10-20 Thread ruiliang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ruiliang resolved HDFS-16806.
-
Hadoop Flags: Reviewed
  Resolution: Fixed

> ec data balancer block blk_id The index error ,Data cannot be moved
> ---
>
> Key: HDFS-16806
> URL: https://issues.apache.org/jira/browse/HDFS-16806
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.0
>Reporter: ruiliang
>Priority: Critical
> Attachments: image-2022-10-20-11-32-35-833.png
>
>
> ec data balancer block blk_id The index error ,Data cannot be moved
> dn->10.12.15.149 use disk 100%
>  
> {code:java}
> echo 10.12.15.149>sorucehost
> balancer  -fs hdfs://xxcluster06  -threshold 10 -source -f sorucehost   
> 2>>~/balancer.log &  {code}
>  
> datanode logs 
> A lot of this log output  
> {code:java}
> datanode logs
> ...
> 2022-10-19 14:43:02,031 ERROR datanode.DataNode (DataXceiver.java:run(321)) - 
> fs-hiido-dn-12-15-149.xx.com:1019:DataXceiver error processing COPY_BLOCK 
> operation  src: /10.12.65.216:58214 dst: /10.12.15.149:1019
> org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica not 
> found for 
> BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036799576592_4218617
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.getReplica(BlockSender.java:492)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:256)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.copyBlock(DataXceiver.java:1089)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opCopyBlock(Receiver.java:291)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:113)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:290)
>         at java.lang.Thread.run(Thread.java:748)
> ...    
>     
> hdfs fsck -fs hdfs://xxcluster06 -blockId blk_-9223372036799576592 
> Connecting to namenode via 
> http://fs-hiido-xxcluster06-yynn2.xx.com:50070/fsck?ugi=hdfs=blk_-9223372036799576592+=%2F
> FSCK started by hdfs (auth:KERBEROS_SSL) from /10.12.19.4 at Wed Oct 19 
> 14:47:15 CST 2022Block Id: blk_-9223372036799576592
> Block belongs to: 
> /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz
> No. of Expected Replica: 5
> No. of live Replica: 5
> No. of excess Replica: 0
> No. of stale Replica: 5
> No. of decommissioned Replica: 0
> No. of decommissioning Replica: 0
> No. of corrupted Replica: 0
> Block replica on datanode/rack: fs-hiido-dn-12-66-4.xx.com/4F08-01-09 is 
> HEALTHY
> Block replica on datanode/rack: fs-hiido-dn-12-65-244.xx.com/4F08-01-08 is 
> HEALTHY
> Block replica on datanode/rack: fs-hiido-dn-12-15-149.xx.com/4F08-05-13 is 
> HEALTHY
> Block replica on datanode/rack: fs-hiido-dn-12-65-218.xx.com/4F08-12-04 is 
> HEALTHY
> Block replica on datanode/rack: fs-hiido-dn-12-17-35.xx.com/4F08-03-03 is 
> HEALTHY
> hdfs fsck -fs hdfs://xxcluster06 
> /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz
>  -files -blocks -locations
> Connecting to namenode via 
> http://xx.com:50070/fsck?ugi=hdfs=1=1=1=%2Fhive_warehouse%2Fwarehouse_old_snapshots%2Fyy_mbsdkevent_original%2Fdt%3D20210505%2Fpost_202105052129_33.log.gz
> FSCK started by hdfs (auth:KERBEROS_SSL) from /10.12.19.4 for path 
> /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz
>  at Wed Oct 19 14:48:42 CST 2022
> /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz
>  500582412 bytes, erasure-coded: policy=RS-3-2-1024k, 1 block(s):  OK
> 0. BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036799576592_4218617 
> len=500582412 Live_repl=5  
> [blk_-9223372036799576592:DatanodeInfoWithStorage[10.12.17.35:1019,DS-3ccebf8d-5f05-45b5-ac7f-96d1cfb48608,DISK],
>  
> blk_-9223372036799576591:DatanodeInfoWithStorage[10.12.65.218:1019,DS-4f8e3114-7566-4cf1-ad5a-e454c8ea8805,DISK],
>  
> blk_-9223372036799576590:DatanodeInfoWithStorage[10.12.15.149:1019,DS-1dd55c27-8f47-46a6-935b-1d9024ca9188,DISK],
>  
> blk_-9223372036799576589:DatanodeInfoWithStorage[10.12.65.244:1019,DS-a9ffd747-c427-4aaa-8559-04cded7d9d5f,DISK],
>  
> blk_-9223372036799576588:DatanodeInfoWithStorage[10.12.66.4:1019,DS-d88f94db-6db1-4753-a652-780d7cd7f081,DISK]]
> Status: HEALTHY
>  Number of data-nodes:  62
>  Number of racks:               19
>  Total dirs:                    0
>  Total symlinks:                0Replicated Blocks:
>  Total size:    0 B
>  Total files:   0
>  Total blocks (validated):      0
>  Minimally replicated blocks:   0
>  Over-replicated blocks:        0
>  Under-replicated 

[jira] [Commented] (HDFS-16806) ec data balancer block blk_id The index error ,Data cannot be moved

2022-10-19 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17620704#comment-17620704
 ] 

ruiliang commented on HDFS-16806:
-

After I pull HDFS-16333, I only update hadoop-hdfs.jar on balancer client 
service, and the problem is solved. The following figure is a comparison before 
and after the update.

!image-2022-10-20-11-32-35-833.png!

> ec data balancer block blk_id The index error ,Data cannot be moved
> ---
>
> Key: HDFS-16806
> URL: https://issues.apache.org/jira/browse/HDFS-16806
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.0
>Reporter: ruiliang
>Priority: Critical
> Attachments: image-2022-10-20-11-32-35-833.png
>
>
> ec data balancer block blk_id The index error ,Data cannot be moved
> dn->10.12.15.149 use disk 100%
>  
> {code:java}
> echo 10.12.15.149>sorucehost
> balancer  -fs hdfs://xxcluster06  -threshold 10 -source -f sorucehost   
> 2>>~/balancer.log &  {code}
>  
> datanode logs 
> A lot of this log output  
> {code:java}
> datanode logs
> ...
> 2022-10-19 14:43:02,031 ERROR datanode.DataNode (DataXceiver.java:run(321)) - 
> fs-hiido-dn-12-15-149.xx.com:1019:DataXceiver error processing COPY_BLOCK 
> operation  src: /10.12.65.216:58214 dst: /10.12.15.149:1019
> org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica not 
> found for 
> BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036799576592_4218617
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.getReplica(BlockSender.java:492)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:256)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.copyBlock(DataXceiver.java:1089)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opCopyBlock(Receiver.java:291)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:113)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:290)
>         at java.lang.Thread.run(Thread.java:748)
> ...    
>     
> hdfs fsck -fs hdfs://xxcluster06 -blockId blk_-9223372036799576592 
> Connecting to namenode via 
> http://fs-hiido-xxcluster06-yynn2.xx.com:50070/fsck?ugi=hdfs=blk_-9223372036799576592+=%2F
> FSCK started by hdfs (auth:KERBEROS_SSL) from /10.12.19.4 at Wed Oct 19 
> 14:47:15 CST 2022Block Id: blk_-9223372036799576592
> Block belongs to: 
> /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz
> No. of Expected Replica: 5
> No. of live Replica: 5
> No. of excess Replica: 0
> No. of stale Replica: 5
> No. of decommissioned Replica: 0
> No. of decommissioning Replica: 0
> No. of corrupted Replica: 0
> Block replica on datanode/rack: fs-hiido-dn-12-66-4.xx.com/4F08-01-09 is 
> HEALTHY
> Block replica on datanode/rack: fs-hiido-dn-12-65-244.xx.com/4F08-01-08 is 
> HEALTHY
> Block replica on datanode/rack: fs-hiido-dn-12-15-149.xx.com/4F08-05-13 is 
> HEALTHY
> Block replica on datanode/rack: fs-hiido-dn-12-65-218.xx.com/4F08-12-04 is 
> HEALTHY
> Block replica on datanode/rack: fs-hiido-dn-12-17-35.xx.com/4F08-03-03 is 
> HEALTHY
> hdfs fsck -fs hdfs://xxcluster06 
> /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz
>  -files -blocks -locations
> Connecting to namenode via 
> http://xx.com:50070/fsck?ugi=hdfs=1=1=1=%2Fhive_warehouse%2Fwarehouse_old_snapshots%2Fyy_mbsdkevent_original%2Fdt%3D20210505%2Fpost_202105052129_33.log.gz
> FSCK started by hdfs (auth:KERBEROS_SSL) from /10.12.19.4 for path 
> /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz
>  at Wed Oct 19 14:48:42 CST 2022
> /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz
>  500582412 bytes, erasure-coded: policy=RS-3-2-1024k, 1 block(s):  OK
> 0. BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036799576592_4218617 
> len=500582412 Live_repl=5  
> [blk_-9223372036799576592:DatanodeInfoWithStorage[10.12.17.35:1019,DS-3ccebf8d-5f05-45b5-ac7f-96d1cfb48608,DISK],
>  
> blk_-9223372036799576591:DatanodeInfoWithStorage[10.12.65.218:1019,DS-4f8e3114-7566-4cf1-ad5a-e454c8ea8805,DISK],
>  
> blk_-9223372036799576590:DatanodeInfoWithStorage[10.12.15.149:1019,DS-1dd55c27-8f47-46a6-935b-1d9024ca9188,DISK],
>  
> blk_-9223372036799576589:DatanodeInfoWithStorage[10.12.65.244:1019,DS-a9ffd747-c427-4aaa-8559-04cded7d9d5f,DISK],
>  
> blk_-9223372036799576588:DatanodeInfoWithStorage[10.12.66.4:1019,DS-d88f94db-6db1-4753-a652-780d7cd7f081,DISK]]
> Status: HEALTHY
>  Number of data-nodes:  62
>  Number of racks:               19
>  Total dirs:                    0
>  Total symlinks:   

[jira] [Updated] (HDFS-16806) ec data balancer block blk_id The index error ,Data cannot be moved

2022-10-19 Thread ruiliang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ruiliang updated HDFS-16806:

Attachment: image-2022-10-20-11-32-35-833.png

> ec data balancer block blk_id The index error ,Data cannot be moved
> ---
>
> Key: HDFS-16806
> URL: https://issues.apache.org/jira/browse/HDFS-16806
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.0
>Reporter: ruiliang
>Priority: Critical
> Attachments: image-2022-10-20-11-32-35-833.png
>
>
> ec data balancer block blk_id The index error ,Data cannot be moved
> dn->10.12.15.149 use disk 100%
>  
> {code:java}
> echo 10.12.15.149>sorucehost
> balancer  -fs hdfs://xxcluster06  -threshold 10 -source -f sorucehost   
> 2>>~/balancer.log &  {code}
>  
> datanode logs 
> A lot of this log output  
> {code:java}
> datanode logs
> ...
> 2022-10-19 14:43:02,031 ERROR datanode.DataNode (DataXceiver.java:run(321)) - 
> fs-hiido-dn-12-15-149.xx.com:1019:DataXceiver error processing COPY_BLOCK 
> operation  src: /10.12.65.216:58214 dst: /10.12.15.149:1019
> org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica not 
> found for 
> BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036799576592_4218617
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.getReplica(BlockSender.java:492)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:256)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.copyBlock(DataXceiver.java:1089)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opCopyBlock(Receiver.java:291)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:113)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:290)
>         at java.lang.Thread.run(Thread.java:748)
> ...    
>     
> hdfs fsck -fs hdfs://xxcluster06 -blockId blk_-9223372036799576592 
> Connecting to namenode via 
> http://fs-hiido-xxcluster06-yynn2.xx.com:50070/fsck?ugi=hdfs=blk_-9223372036799576592+=%2F
> FSCK started by hdfs (auth:KERBEROS_SSL) from /10.12.19.4 at Wed Oct 19 
> 14:47:15 CST 2022Block Id: blk_-9223372036799576592
> Block belongs to: 
> /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz
> No. of Expected Replica: 5
> No. of live Replica: 5
> No. of excess Replica: 0
> No. of stale Replica: 5
> No. of decommissioned Replica: 0
> No. of decommissioning Replica: 0
> No. of corrupted Replica: 0
> Block replica on datanode/rack: fs-hiido-dn-12-66-4.xx.com/4F08-01-09 is 
> HEALTHY
> Block replica on datanode/rack: fs-hiido-dn-12-65-244.xx.com/4F08-01-08 is 
> HEALTHY
> Block replica on datanode/rack: fs-hiido-dn-12-15-149.xx.com/4F08-05-13 is 
> HEALTHY
> Block replica on datanode/rack: fs-hiido-dn-12-65-218.xx.com/4F08-12-04 is 
> HEALTHY
> Block replica on datanode/rack: fs-hiido-dn-12-17-35.xx.com/4F08-03-03 is 
> HEALTHY
> hdfs fsck -fs hdfs://xxcluster06 
> /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz
>  -files -blocks -locations
> Connecting to namenode via 
> http://xx.com:50070/fsck?ugi=hdfs=1=1=1=%2Fhive_warehouse%2Fwarehouse_old_snapshots%2Fyy_mbsdkevent_original%2Fdt%3D20210505%2Fpost_202105052129_33.log.gz
> FSCK started by hdfs (auth:KERBEROS_SSL) from /10.12.19.4 for path 
> /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz
>  at Wed Oct 19 14:48:42 CST 2022
> /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz
>  500582412 bytes, erasure-coded: policy=RS-3-2-1024k, 1 block(s):  OK
> 0. BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036799576592_4218617 
> len=500582412 Live_repl=5  
> [blk_-9223372036799576592:DatanodeInfoWithStorage[10.12.17.35:1019,DS-3ccebf8d-5f05-45b5-ac7f-96d1cfb48608,DISK],
>  
> blk_-9223372036799576591:DatanodeInfoWithStorage[10.12.65.218:1019,DS-4f8e3114-7566-4cf1-ad5a-e454c8ea8805,DISK],
>  
> blk_-9223372036799576590:DatanodeInfoWithStorage[10.12.15.149:1019,DS-1dd55c27-8f47-46a6-935b-1d9024ca9188,DISK],
>  
> blk_-9223372036799576589:DatanodeInfoWithStorage[10.12.65.244:1019,DS-a9ffd747-c427-4aaa-8559-04cded7d9d5f,DISK],
>  
> blk_-9223372036799576588:DatanodeInfoWithStorage[10.12.66.4:1019,DS-d88f94db-6db1-4753-a652-780d7cd7f081,DISK]]
> Status: HEALTHY
>  Number of data-nodes:  62
>  Number of racks:               19
>  Total dirs:                    0
>  Total symlinks:                0Replicated Blocks:
>  Total size:    0 B
>  Total files:   0
>  Total blocks (validated):      0
>  Minimally replicated blocks:   0
>  Over-replicated blocks:        0
>  Under-replicated 

[jira] [Commented] (HDFS-16806) ec data balancer block blk_id The index error ,Data cannot be moved

2022-10-19 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17620065#comment-17620065
 ] 

ruiliang commented on HDFS-16806:
-

https://issues.apache.org/jira/browse/HDFS-16333

Is that the question?

All I have to do is join the balancer client, right?

Or pull it to the namenode server

> ec data balancer block blk_id The index error ,Data cannot be moved
> ---
>
> Key: HDFS-16806
> URL: https://issues.apache.org/jira/browse/HDFS-16806
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.0
>Reporter: ruiliang
>Priority: Blocker
>
> ec data balancer block blk_id The index error ,Data cannot be moved
> dn->10.12.15.149 use disk 100%
>  
> {code:java}
> echo 10.12.15.149>sorucehost
> balancer  -fs hdfs://xxcluster06  -threshold 10 -source -f sorucehost   
> 2>>~/balancer.log &  {code}
>  
> datanode logs 
> A lot of this log output  
> {code:java}
> datanode logs
> ...
> 2022-10-19 14:43:02,031 ERROR datanode.DataNode (DataXceiver.java:run(321)) - 
> fs-hiido-dn-12-15-149.xx.com:1019:DataXceiver error processing COPY_BLOCK 
> operation  src: /10.12.65.216:58214 dst: /10.12.15.149:1019
> org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica not 
> found for 
> BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036799576592_4218617
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.getReplica(BlockSender.java:492)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:256)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.copyBlock(DataXceiver.java:1089)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opCopyBlock(Receiver.java:291)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:113)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:290)
>         at java.lang.Thread.run(Thread.java:748)
> ...    
>     
> hdfs fsck -fs hdfs://xxcluster06 -blockId blk_-9223372036799576592 
> Connecting to namenode via 
> http://fs-hiido-xxcluster06-yynn2.xx.com:50070/fsck?ugi=hdfs=blk_-9223372036799576592+=%2F
> FSCK started by hdfs (auth:KERBEROS_SSL) from /10.12.19.4 at Wed Oct 19 
> 14:47:15 CST 2022Block Id: blk_-9223372036799576592
> Block belongs to: 
> /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz
> No. of Expected Replica: 5
> No. of live Replica: 5
> No. of excess Replica: 0
> No. of stale Replica: 5
> No. of decommissioned Replica: 0
> No. of decommissioning Replica: 0
> No. of corrupted Replica: 0
> Block replica on datanode/rack: fs-hiido-dn-12-66-4.xx.com/4F08-01-09 is 
> HEALTHY
> Block replica on datanode/rack: fs-hiido-dn-12-65-244.xx.com/4F08-01-08 is 
> HEALTHY
> Block replica on datanode/rack: fs-hiido-dn-12-15-149.xx.com/4F08-05-13 is 
> HEALTHY
> Block replica on datanode/rack: fs-hiido-dn-12-65-218.xx.com/4F08-12-04 is 
> HEALTHY
> Block replica on datanode/rack: fs-hiido-dn-12-17-35.xx.com/4F08-03-03 is 
> HEALTHY
> hdfs fsck -fs hdfs://xxcluster06 
> /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz
>  -files -blocks -locations
> Connecting to namenode via 
> http://xx.com:50070/fsck?ugi=hdfs=1=1=1=%2Fhive_warehouse%2Fwarehouse_old_snapshots%2Fyy_mbsdkevent_original%2Fdt%3D20210505%2Fpost_202105052129_33.log.gz
> FSCK started by hdfs (auth:KERBEROS_SSL) from /10.12.19.4 for path 
> /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz
>  at Wed Oct 19 14:48:42 CST 2022
> /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz
>  500582412 bytes, erasure-coded: policy=RS-3-2-1024k, 1 block(s):  OK
> 0. BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036799576592_4218617 
> len=500582412 Live_repl=5  
> [blk_-9223372036799576592:DatanodeInfoWithStorage[10.12.17.35:1019,DS-3ccebf8d-5f05-45b5-ac7f-96d1cfb48608,DISK],
>  
> blk_-9223372036799576591:DatanodeInfoWithStorage[10.12.65.218:1019,DS-4f8e3114-7566-4cf1-ad5a-e454c8ea8805,DISK],
>  
> blk_-9223372036799576590:DatanodeInfoWithStorage[10.12.15.149:1019,DS-1dd55c27-8f47-46a6-935b-1d9024ca9188,DISK],
>  
> blk_-9223372036799576589:DatanodeInfoWithStorage[10.12.65.244:1019,DS-a9ffd747-c427-4aaa-8559-04cded7d9d5f,DISK],
>  
> blk_-9223372036799576588:DatanodeInfoWithStorage[10.12.66.4:1019,DS-d88f94db-6db1-4753-a652-780d7cd7f081,DISK]]
> Status: HEALTHY
>  Number of data-nodes:  62
>  Number of racks:               19
>  Total dirs:                    0
>  Total symlinks:                0Replicated Blocks:
>  Total size:    0 B
>  Total files:   0
>  Total blocks (validated):      0
>  

[jira] [Updated] (HDFS-16806) ec data balancer block blk_id The index error ,Data cannot be moved

2022-10-19 Thread ruiliang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ruiliang updated HDFS-16806:

Description: 
ec data balancer block blk_id The index error ,Data cannot be moved

dn->10.12.15.149 use disk 100%

 
{code:java}
echo 10.12.15.149>sorucehost
balancer  -fs hdfs://xxcluster06  -threshold 10 -source -f sorucehost   
2>>~/balancer.log &  {code}
 

datanode logs 
A lot of this log output  
{code:java}
datanode logs
...
2022-10-19 14:43:02,031 ERROR datanode.DataNode (DataXceiver.java:run(321)) - 
fs-hiido-dn-12-15-149.xx.com:1019:DataXceiver error processing COPY_BLOCK 
operation  src: /10.12.65.216:58214 dst: /10.12.15.149:1019
org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica not 
found for 
BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036799576592_4218617
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.getReplica(BlockSender.java:492)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:256)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.copyBlock(DataXceiver.java:1089)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opCopyBlock(Receiver.java:291)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:113)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:290)
        at java.lang.Thread.run(Thread.java:748)
...    
    
hdfs fsck -fs hdfs://xxcluster06 -blockId blk_-9223372036799576592 
Connecting to namenode via 
http://fs-hiido-xxcluster06-yynn2.xx.com:50070/fsck?ugi=hdfs=blk_-9223372036799576592+=%2F
FSCK started by hdfs (auth:KERBEROS_SSL) from /10.12.19.4 at Wed Oct 19 
14:47:15 CST 2022Block Id: blk_-9223372036799576592
Block belongs to: 
/hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz
No. of Expected Replica: 5
No. of live Replica: 5
No. of excess Replica: 0
No. of stale Replica: 5
No. of decommissioned Replica: 0
No. of decommissioning Replica: 0
No. of corrupted Replica: 0
Block replica on datanode/rack: fs-hiido-dn-12-66-4.xx.com/4F08-01-09 is HEALTHY
Block replica on datanode/rack: fs-hiido-dn-12-65-244.xx.com/4F08-01-08 is 
HEALTHY
Block replica on datanode/rack: fs-hiido-dn-12-15-149.xx.com/4F08-05-13 is 
HEALTHY
Block replica on datanode/rack: fs-hiido-dn-12-65-218.xx.com/4F08-12-04 is 
HEALTHY
Block replica on datanode/rack: fs-hiido-dn-12-17-35.xx.com/4F08-03-03 is 
HEALTHY



hdfs fsck -fs hdfs://xxcluster06 
/hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz
 -files -blocks -locations
Connecting to namenode via 
http://xx.com:50070/fsck?ugi=hdfs=1=1=1=%2Fhive_warehouse%2Fwarehouse_old_snapshots%2Fyy_mbsdkevent_original%2Fdt%3D20210505%2Fpost_202105052129_33.log.gz
FSCK started by hdfs (auth:KERBEROS_SSL) from /10.12.19.4 for path 
/hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz
 at Wed Oct 19 14:48:42 CST 2022
/hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz
 500582412 bytes, erasure-coded: policy=RS-3-2-1024k, 1 block(s):  OK
0. BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036799576592_4218617 
len=500582412 Live_repl=5  
[blk_-9223372036799576592:DatanodeInfoWithStorage[10.12.17.35:1019,DS-3ccebf8d-5f05-45b5-ac7f-96d1cfb48608,DISK],
 
blk_-9223372036799576591:DatanodeInfoWithStorage[10.12.65.218:1019,DS-4f8e3114-7566-4cf1-ad5a-e454c8ea8805,DISK],
 
blk_-9223372036799576590:DatanodeInfoWithStorage[10.12.15.149:1019,DS-1dd55c27-8f47-46a6-935b-1d9024ca9188,DISK],
 
blk_-9223372036799576589:DatanodeInfoWithStorage[10.12.65.244:1019,DS-a9ffd747-c427-4aaa-8559-04cded7d9d5f,DISK],
 
blk_-9223372036799576588:DatanodeInfoWithStorage[10.12.66.4:1019,DS-d88f94db-6db1-4753-a652-780d7cd7f081,DISK]]
Status: HEALTHY
 Number of data-nodes:  62
 Number of racks:               19
 Total dirs:                    0
 Total symlinks:                0Replicated Blocks:
 Total size:    0 B
 Total files:   0
 Total blocks (validated):      0
 Minimally replicated blocks:   0
 Over-replicated blocks:        0
 Under-replicated blocks:       0
 Mis-replicated blocks:         0
 Default replication factor:    3
 Average block replication:     0.0
 Missing blocks:                0
 Corrupt blocks:                0
 Missing replicas:              0Erasure Coded Block Groups:
 Total size:    500582412 B
 Total files:   1
 Total block groups (validated):        1 (avg. block group size 500582412 B)
 Minimally erasure-coded block groups:  1 (100.0 %)
 Over-erasure-coded block groups:       0 (0.0 %)
 Under-erasure-coded block groups:      0 (0.0 %)
 Unsatisfactory placement block groups: 0 (0.0 %)
 Average block group size:      5.0
 Missing block groups:          0
 Corrupt block groups:          0
 Missing internal 

[jira] [Created] (HDFS-16806) ec data balancer block blk_id The index error ,Data cannot be moved

2022-10-19 Thread ruiliang (Jira)
ruiliang created HDFS-16806:
---

 Summary: ec data balancer block blk_id The index error ,Data 
cannot be moved
 Key: HDFS-16806
 URL: https://issues.apache.org/jira/browse/HDFS-16806
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs
Affects Versions: 3.1.0
Reporter: ruiliang


ec data balancer block blk_id The index error ,Data cannot be moved

dn->10.12.15.149 use disk 100%
{code:java}
echo 10.12.15.149>sorucehost
balancer  -fs hdfs://xxcluster06  -threshold 10 -source -f sorucehost   
2>>~/balancer.log & 
 {code}
{code:java}
datanode logs
...
2022-10-19 14:43:02,031 ERROR datanode.DataNode (DataXceiver.java:run(321)) - 
fs-hiido-dn-12-15-149.xx.com:1019:DataXceiver error processing COPY_BLOCK 
operation  src: /10.12.65.216:58214 dst: /10.12.15.149:1019
org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica not 
found for 
BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036799576592_4218617
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.getReplica(BlockSender.java:492)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:256)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.copyBlock(DataXceiver.java:1089)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opCopyBlock(Receiver.java:291)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:113)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:290)
        at java.lang.Thread.run(Thread.java:748)
...        
hdfs fsck -fs hdfs://xxcluster06 -blockId blk_-9223372036799576592 
Connecting to namenode via 
http://fs-hiido-xxcluster06-yynn2.xx.com:50070/fsck?ugi=hdfs=blk_-9223372036799576592+=%2F
FSCK started by hdfs (auth:KERBEROS_SSL) from /10.12.19.4 at Wed Oct 19 
14:47:15 CST 2022Block Id: blk_-9223372036799576592
Block belongs to: 
/hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz
No. of Expected Replica: 5
No. of live Replica: 5
No. of excess Replica: 0
No. of stale Replica: 5
No. of decommissioned Replica: 0
No. of decommissioning Replica: 0
No. of corrupted Replica: 0
Block replica on datanode/rack: fs-hiido-dn-12-66-4.xx.com/4F08-01-09 is HEALTHY
Block replica on datanode/rack: fs-hiido-dn-12-65-244.xx.com/4F08-01-08 is 
HEALTHY
Block replica on datanode/rack: fs-hiido-dn-12-15-149.xx.com/4F08-05-13 is 
HEALTHY
Block replica on datanode/rack: fs-hiido-dn-12-65-218.xx.com/4F08-12-04 is 
HEALTHY
Block replica on datanode/rack: fs-hiido-dn-12-17-35.xx.com/4F08-03-03 is 
HEALTHYhdfs fsck -fs hdfs://xxcluster06 
/hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz
 -files -blocks -locations
Connecting to namenode via 
http://xx.com:50070/fsck?ugi=hdfs=1=1=1=%2Fhive_warehouse%2Fwarehouse_old_snapshots%2Fyy_mbsdkevent_original%2Fdt%3D20210505%2Fpost_202105052129_33.log.gz
FSCK started by hdfs (auth:KERBEROS_SSL) from /10.12.19.4 for path 
/hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz
 at Wed Oct 19 14:48:42 CST 2022
/hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz
 500582412 bytes, erasure-coded: policy=RS-3-2-1024k, 1 block(s):  OK
0. BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036799576592_4218617 
len=500582412 Live_repl=5  
[blk_-9223372036799576592:DatanodeInfoWithStorage[10.12.17.35:1019,DS-3ccebf8d-5f05-45b5-ac7f-96d1cfb48608,DISK],
 
blk_-9223372036799576591:DatanodeInfoWithStorage[10.12.65.218:1019,DS-4f8e3114-7566-4cf1-ad5a-e454c8ea8805,DISK],
 
blk_-9223372036799576590:DatanodeInfoWithStorage[10.12.15.149:1019,DS-1dd55c27-8f47-46a6-935b-1d9024ca9188,DISK],
 
blk_-9223372036799576589:DatanodeInfoWithStorage[10.12.65.244:1019,DS-a9ffd747-c427-4aaa-8559-04cded7d9d5f,DISK],
 
blk_-9223372036799576588:DatanodeInfoWithStorage[10.12.66.4:1019,DS-d88f94db-6db1-4753-a652-780d7cd7f081,DISK]]
Status: HEALTHY
 Number of data-nodes:  62
 Number of racks:               19
 Total dirs:                    0
 Total symlinks:                0Replicated Blocks:
 Total size:    0 B
 Total files:   0
 Total blocks (validated):      0
 Minimally replicated blocks:   0
 Over-replicated blocks:        0
 Under-replicated blocks:       0
 Mis-replicated blocks:         0
 Default replication factor:    3
 Average block replication:     0.0
 Missing blocks:                0
 Corrupt blocks:                0
 Missing replicas:              0Erasure Coded Block Groups:
 Total size:    500582412 B
 Total files:   1
 Total block groups (validated):        1 (avg. block group size 500582412 B)
 Minimally erasure-coded block groups:  1 (100.0 %)
 Over-erasure-coded block groups:       0 (0.0 %)
 Under-erasure-coded block groups:      0 (0.0 %)
 Unsatisfactory placement 

[jira] [Comment Edited] (HDFS-16799) The dn space size is not consistent, and Balancer can not work, resulting in a very unbalanced space

2022-10-12 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17614703#comment-17614703
 ] 

ruiliang edited comment on HDFS-16799 at 10/13/22 2:19 AM:
---

It seems that the empty nodes on the back of the rack are concentrated, so it 
is not possible to select enough racks first. In this case, only the rack is 
Adjust a reasonable num?
{code:java}
  Datanode 10.12.65.241:1019 is not chosen since the rack has too many chosen 
nodes.
2022-10-09 19:27:18,407 DEBUG blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseLocalRack(637)) - Failed to choose from 
local rack (location = /4F08-05-15), retry with the rack of the next replica 
(location = /4F08-12-03)
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException:
 
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:834)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlentPolicyDefault.java:629)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalStorage(BlockPlacementPolicyDefault.java:589)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyRackFaultTolerant.chooseOnce(BlockPlacementPolicyRackFaultTolerant.java:218)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyRackFaultTolerant.chooseTargetInOrder(BlockPlacementPolicyRackFaultTolerant.java:94)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:419)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:295)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:148)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.ErasureCodingWork.chooseTargets(ErasureCodingWork.java:60)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1862)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1814)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4655)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4522)
        at java.lang.Thread.run(Thread.java:748)
2022-10-09 19:27:18,416 DEBUG blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseRandom(824)) - [
Node /4F08-01-08/10.12.65.242:1019 [
  Datanode 10.12.65.242:1019 is not chosen since the rack has too many chosen 
nodes.
Node /4F08-01-08/10.12.65.248:1019 [
  Datanode 10.12.65.248:1019 is not chosen since the rack has too many chosen 
nodes.
Node /4F08-01-08/10.12.65.195:1019 [
  Datanode 10.12.65.195:1019 is not chosen since the rack has too many chosen 
nodes.
Node /4F08-01-08/10.12.65.241:1019 [
  Datanode 10.12.65.241:1019 is not chosen since the rack has too many chosen 
nodes.
Node /4F08-01-08/10.12.65.243:1019 [
  Datanode 10.12.65.243:1019 is not chosen since the rack has too many chosen 
nodes.
Node /4F08-01-08/10.12.65.244:1019 [
  Datanode 10.12.65.244:1019 is not chosen since the rack has too many chosen 
nodes.
Node /4F08-01-08/10.12.65.249:1019 [
  Datanode 10.12.65.249:1019 is not chosen since the rack has too many chosen 
nodes.
Node /4F08-01-08/10.12.65.245:1019 [
  Datanode 10.12.65.245:1019 is not chosen since the rack has too many chosen 
nodes.
Node /4F08-01-08/10.12.65.240:1019 [
  Datanode 10.12.65.240:1019 is not chosen since the rack has too many chosen 
nodes.
Node /4F08-01-08/10.12.65.247:1019 [
  Datanode 10.12.65.247:1019 is not chosen since the rack has too many chosen 
nodes.
2022-10-09 19:27:18,416 INFO  blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseRandom(832)) - Not enough replicas was 
chosen. Reason:{TOO_MANY_NODES_ON_RACK=10}
2022-10-09 19:27:18,417 DEBUG blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseFromNextRack(669)) - Failed to choose 
from the next rack (location = /4F08-01-08), retry choosing randomly
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException:
 
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:834)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:722)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseFromNextRack(BlockPlacementPolicyDefault.java:665)
        at 

[jira] [Commented] (HDFS-16799) The dn space size is not consistent, and Balancer can not work, resulting in a very unbalanced space

2022-10-09 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17614703#comment-17614703
 ] 

ruiliang commented on HDFS-16799:
-

It seems that the empty nodes on the back of the rack are concentrated, so it 
is not possible to select enough racks first. In this case, only the rack is 
broken up?
{code:java}
  Datanode 10.12.65.241:1019 is not chosen since the rack has too many chosen 
nodes.
2022-10-09 19:27:18,407 DEBUG blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseLocalRack(637)) - Failed to choose from 
local rack (location = /4F08-05-15), retry with the rack of the next replica 
(location = /4F08-12-03)
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException:
 
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:834)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlentPolicyDefault.java:629)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalStorage(BlockPlacementPolicyDefault.java:589)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyRackFaultTolerant.chooseOnce(BlockPlacementPolicyRackFaultTolerant.java:218)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyRackFaultTolerant.chooseTargetInOrder(BlockPlacementPolicyRackFaultTolerant.java:94)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:419)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:295)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:148)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.ErasureCodingWork.chooseTargets(ErasureCodingWork.java:60)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1862)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1814)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4655)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4522)
        at java.lang.Thread.run(Thread.java:748)
2022-10-09 19:27:18,416 DEBUG blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseRandom(824)) - [
Node /4F08-01-08/10.12.65.242:1019 [
  Datanode 10.12.65.242:1019 is not chosen since the rack has too many chosen 
nodes.
Node /4F08-01-08/10.12.65.248:1019 [
  Datanode 10.12.65.248:1019 is not chosen since the rack has too many chosen 
nodes.
Node /4F08-01-08/10.12.65.195:1019 [
  Datanode 10.12.65.195:1019 is not chosen since the rack has too many chosen 
nodes.
Node /4F08-01-08/10.12.65.241:1019 [
  Datanode 10.12.65.241:1019 is not chosen since the rack has too many chosen 
nodes.
Node /4F08-01-08/10.12.65.243:1019 [
  Datanode 10.12.65.243:1019 is not chosen since the rack has too many chosen 
nodes.
Node /4F08-01-08/10.12.65.244:1019 [
  Datanode 10.12.65.244:1019 is not chosen since the rack has too many chosen 
nodes.
Node /4F08-01-08/10.12.65.249:1019 [
  Datanode 10.12.65.249:1019 is not chosen since the rack has too many chosen 
nodes.
Node /4F08-01-08/10.12.65.245:1019 [
  Datanode 10.12.65.245:1019 is not chosen since the rack has too many chosen 
nodes.
Node /4F08-01-08/10.12.65.240:1019 [
  Datanode 10.12.65.240:1019 is not chosen since the rack has too many chosen 
nodes.
Node /4F08-01-08/10.12.65.247:1019 [
  Datanode 10.12.65.247:1019 is not chosen since the rack has too many chosen 
nodes.
2022-10-09 19:27:18,416 INFO  blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseRandom(832)) - Not enough replicas was 
chosen. Reason:{TOO_MANY_NODES_ON_RACK=10}
2022-10-09 19:27:18,417 DEBUG blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseFromNextRack(669)) - Failed to choose 
from the next rack (location = /4F08-01-08), retry choosing randomly
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException:
 
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:834)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:722)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseFromNextRack(BlockPlacementPolicyDefault.java:665)
        at 

[jira] [Updated] (HDFS-16799) The dn space size is not consistent, and Balancer can not work, resulting in a very unbalanced space

2022-10-09 Thread ruiliang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ruiliang updated HDFS-16799:

Description: 
 
{code:java}
echo 'A DFS Used 99.8% to ip' > sorucehost  
hdfs --debug  balancer  -fs hdfs://xxcluster06  -threshold 10 -source -f 
sorucehost  

22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-01-08/10.12.65.243:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-01-08/10.12.65.247:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-15-10/10.12.65.214:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-02-08/10.12.14.8:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-05-13/10.12.15.154:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-12-04/10.12.65.218:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-12-03/10.12.65.143:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-05-05/10.12.12.200:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-12-03/10.12.65.217:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-12-03/10.12.65.142:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-01-08/10.12.65.246:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-12-03/10.12.65.219:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-12-03/10.12.65.147:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-15-10/10.12.65.186:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-05-13/10.12.15.153:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-03-07/10.12.19.23:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-04-14/10.12.65.119:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-12-03/10.12.65.131:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-05-04/10.12.12.210:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-05-11/10.12.14.168:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-01-08/10.12.65.245:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-03-02/10.12.17.26:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-01-08/10.12.65.241:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-05-13/10.12.15.152:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-01-08/10.12.65.249:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-07-14/10.12.64.71:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-03-03/10.12.17.35:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-01-08/10.12.65.195:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-01-08/10.12.65.242:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-01-08/10.12.65.248:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-01-08/10.12.65.240:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-15-12/10.12.65.196:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-05-13/10.12.15.150:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-12-03/10.12.65.222:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-12-03/10.12.65.145:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-01-08/10.12.65.244:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-03-07/10.12.19.22:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-12-03/10.12.65.221:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-12-03/10.12.65.136:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-12-03/10.12.65.129:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-05-15/10.12.15.163:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-07-14/10.12.64.72:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-05-13/10.12.15.149:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-12-03/10.12.65.130:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-12-03/10.12.65.220:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-03-01/10.12.17.27:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-05-15/10.12.15.162:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-12-03/10.12.65.216:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-03-07/10.12.19.20:1019
22/10/09 16:43:52 INFO net.NetworkTopology: 

[jira] [Created] (HDFS-16799) The dn space size is not consistent, and Balancer can not work, resulting in a very unbalanced space

2022-10-09 Thread ruiliang (Jira)
ruiliang created HDFS-16799:
---

 Summary: The dn space size is not consistent, and Balancer can not 
work, resulting in a very unbalanced space
 Key: HDFS-16799
 URL: https://issues.apache.org/jira/browse/HDFS-16799
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs
Affects Versions: 3.1.0
Reporter: ruiliang


 
{code:java}
echo 'A DFS Used 99.8% to ip' > sorucehost  
hdfs --debug  balancer  -fs hdfs://xxcluster06  -threshold 10 -source -f 
sorucehost  

22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-01-08/10.12.65.243:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-01-08/10.12.65.247:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-15-10/10.12.65.214:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-02-08/10.12.14.8:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-05-13/10.12.15.154:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-12-04/10.12.65.218:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-12-03/10.12.65.143:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-05-05/10.12.12.200:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-12-03/10.12.65.217:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-12-03/10.12.65.142:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-01-08/10.12.65.246:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-12-03/10.12.65.219:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-12-03/10.12.65.147:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-15-10/10.12.65.186:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-05-13/10.12.15.153:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-03-07/10.12.19.23:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-04-14/10.12.65.119:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-12-03/10.12.65.131:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-05-04/10.12.12.210:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-05-11/10.12.14.168:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-01-08/10.12.65.245:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-03-02/10.12.17.26:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-01-08/10.12.65.241:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-05-13/10.12.15.152:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-01-08/10.12.65.249:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-07-14/10.12.64.71:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-03-03/10.12.17.35:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-01-08/10.12.65.195:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-01-08/10.12.65.242:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-01-08/10.12.65.248:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-01-08/10.12.65.240:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-15-12/10.12.65.196:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-05-13/10.12.15.150:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-12-03/10.12.65.222:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-12-03/10.12.65.145:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-01-08/10.12.65.244:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-03-07/10.12.19.22:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-12-03/10.12.65.221:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-12-03/10.12.65.136:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-12-03/10.12.65.129:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-05-15/10.12.15.163:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-07-14/10.12.64.72:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-05-13/10.12.15.149:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-12-03/10.12.65.130:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-12-03/10.12.65.220:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-03-01/10.12.17.27:1019
22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: 
/4F08-05-15/10.12.15.162:1019

[jira] [Updated] (HDFS-16788) could only be written to 2 of the 3 required nodes for RS-3-2-1024k. There are 50 datanode(s) running and no node(s) are excluded in this operation

2022-09-30 Thread ruiliang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ruiliang updated HDFS-16788:

Description: 
 

!image-2022-09-30-14-14-44-164.png!
||Configured Capacity:|3.02 PB|
||Configured Remote Capacity:|0 B|
||DFS Used:|1.39 PB (45.96%)|
||Non DFS Used:|0 B|
||DFS Remaining:|1.62 PB (53.67%)|
||Block Pool Used:|1.39 PB (45.96%)|
||DataNodes usages% (Min/Median/Max/stdDev):|8.20% / 32.44% / 98.85% / 37.30%|
||[Live 
Nodes|http://fs-hiido-yycluster06-yynn1.hiido.host.yydevops.com:50070/dfshealth.html#tab-datanode]|50
 (Decommissioned: 0, In Maintenance: 0)
 |

I've been working hard in the background to balance the data,  

but before I discp when
{code:java}
hdfs balancer -Ddfs.datanode.balance.max.concurrent.moves=300 
-Ddfs.balancer.moverThreads=1200 
-Ddfs.datanode.balance.bandwidthPerSec=1073741824 -fs hdfs://yycluster06 
-threshold 50
{code}
{code:java}
hadoop distcp -Dmapreduce.task.timeout=60 -skipcrccheck -update hdfs://01   
hdfs://02xx

syslog
 ...
2022-09-30 14:22:50,724 WARN [main] org.apache.hadoop.hdfs.DFSOutputStream: 
Cannot allocate parity block(index=4, policy=RS-3-2-1024k). Not enough 
datanodes? Exclude nodes=[] 2022-09-30 14:22:58,389 INFO [main] 
org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed streamer 
#3: failed, blk_-9223372036808890525_3095130 2022-09-30 14:22:58,389 INFO 
[main] org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed 
streamer #4: failed, block==null 2022-09-30 14:23:21,547 WARN [main] 
org.apache.hadoop.hdfs.DFSOutputStream: Cannot allocate parity block(index=4, 
policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[] 2022-09-30 
14:23:29,319 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: replacing 
previously failed streamer #4: failed, blk_-9223372036808889612_3095200 
2022-09-30 14:23:36,950 WARN [main] org.apache.hadoop.hdfs.DFSOutputStream: 
Cannot allocate parity block(index=4, policy=RS-3-2-1024k). Not enough 
datanodes? Exclude nodes=[] 2022-09-30 14:23:44,822 INFO [main] 
org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed streamer 
#4: failed, blk_-922337203680572_3095307 2022-09-30 14:23:44,837 WARN 
[main] org.apache.hadoop.hdfs.DFSOutputStream: Cannot allocate parity 
block(index=4, policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[] 
2022-09-30 14:23:52,306 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: 
replacing previously failed streamer #4: failed, block==null 2022-09-30 
14:23:52,321 WARN [main] org.apache.hadoop.hdfs.DFSOutputStream: Cannot 
allocate parity block(index=4, policy=RS-3-2-1024k). Not enough datanodes? 
Exclude nodes=[] 2022-09-30 14:23:59,822 INFO [main] 
org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed streamer 
#4: failed, block==null 2022-09-30 14:23:59,836 WARN [main] 
org.apache.hadoop.hdfs.DFSOutputStream: Cannot allocate parity block(index=3, 
policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[] 2022-09-30 
14:23:59,836 WARN [main] org.apache.hadoop.hdfs.DFSOutputStream: Cannot 
allocate parity block(index=4, policy=RS-3-2-1024k). Not enough datanodes? 
Exclude nodes=[] 2022-09-30 14:24:07,302 INFO [main] 
org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed streamer 
#3: failed, blk_-9223372036808887853_3095387 2022-09-30 14:24:07,303 INFO 
[main] org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed 
streamer #4: failed, block==null 2022-09-30 14:24:07,317 WARN [main] 
org.apache.hadoop.hdfs.DFSOutputStream: Cannot allocate parity block(index=4, 
policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[] 2022-09-30 
14:24:15,383 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: replacing 
previously failed streamer #4: failed, block==null 2022-09-30 14:24:15,395 WARN 
[main] org.apache.hadoop.hdfs.DFSOutputStream: Cannot allocate parity 
block(index=4, policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[] 
2022-09-30 14:24:22,795 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: 
replacing previously failed streamer #4: failed, block==null 2022-09-30 
14:24:22,812 WARN [main] org.apache.hadoop.hdfs.DFSOutputStream: Cannot 
allocate parity block(index=3, policy=RS-3-2-1024k). Not enough datanodes? 
Exclude nodes=[] 2022-09-30 14:24:22,812 WARN [main] 
org.apache.hadoop.hdfs.DFSOutputStream: Cannot allocate parity block(index=4, 
policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[] 2022-09-30 
14:24:31,352 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: replacing 
previously failed streamer #3: failed, blk_-9223372036808887133_3095476 
2022-09-30 14:24:31,352 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: 
replacing previously failed streamer #4: failed, block==null

discp out
.
Error: java.io.IOException: File copy failed:

 Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 

[jira] [Updated] (HDFS-16788) could only be written to 2 of the 3 required nodes for RS-3-2-1024k. There are 50 datanode(s) running and no node(s) are excluded in this operation

2022-09-30 Thread ruiliang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ruiliang updated HDFS-16788:

Description: 
 

!image-2022-09-30-14-14-44-164.png!
||Configured Capacity:|3.02 PB|
||Configured Remote Capacity:|0 B|
||DFS Used:|1.39 PB (45.96%)|
||Non DFS Used:|0 B|
||DFS Remaining:|1.62 PB (53.67%)|
||Block Pool Used:|1.39 PB (45.96%)|
||DataNodes usages% (Min/Median/Max/stdDev):|8.20% / 32.44% / 98.85% / 37.30%|
||[Live 
Nodes|http://fs-hiido-yycluster06-yynn1.hiido.host.yydevops.com:50070/dfshealth.html#tab-datanode]|50
 (Decommissioned: 0, In Maintenance: 0)
 |

I've been working hard in the background to balance the data,  

but before I discp when
{code:java}
hdfs balancer -Ddfs.datanode.balance.max.concurrent.moves=300 
-Ddfs.balancer.moverThreads=1200 
-Ddfs.datanode.balance.bandwidthPerSec=1073741824 -fs hdfs://yycluster06 
-threshold 50
{code}
{code:java}
hadoop distcp -Dmapreduce.task.timeout=60 -skipcrccheck -update hdfs://01   
hdfs://02xx

...
2022-09-30 14:22:50,724 WARN [main] org.apache.hadoop.hdfs.DFSOutputStream: 
Cannot allocate parity block(index=4, policy=RS-3-2-1024k). Not enough 
datanodes? Exclude nodes=[] 2022-09-30 14:22:58,389 INFO [main] 
org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed streamer 
#3: failed, blk_-9223372036808890525_3095130 2022-09-30 14:22:58,389 INFO 
[main] org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed 
streamer #4: failed, block==null 2022-09-30 14:23:21,547 WARN [main] 
org.apache.hadoop.hdfs.DFSOutputStream: Cannot allocate parity block(index=4, 
policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[] 2022-09-30 
14:23:29,319 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: replacing 
previously failed streamer #4: failed, blk_-9223372036808889612_3095200 
2022-09-30 14:23:36,950 WARN [main] org.apache.hadoop.hdfs.DFSOutputStream: 
Cannot allocate parity block(index=4, policy=RS-3-2-1024k). Not enough 
datanodes? Exclude nodes=[] 2022-09-30 14:23:44,822 INFO [main] 
org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed streamer 
#4: failed, blk_-922337203680572_3095307 2022-09-30 14:23:44,837 WARN 
[main] org.apache.hadoop.hdfs.DFSOutputStream: Cannot allocate parity 
block(index=4, policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[] 
2022-09-30 14:23:52,306 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: 
replacing previously failed streamer #4: failed, block==null 2022-09-30 
14:23:52,321 WARN [main] org.apache.hadoop.hdfs.DFSOutputStream: Cannot 
allocate parity block(index=4, policy=RS-3-2-1024k). Not enough datanodes? 
Exclude nodes=[] 2022-09-30 14:23:59,822 INFO [main] 
org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed streamer 
#4: failed, block==null 2022-09-30 14:23:59,836 WARN [main] 
org.apache.hadoop.hdfs.DFSOutputStream: Cannot allocate parity block(index=3, 
policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[] 2022-09-30 
14:23:59,836 WARN [main] org.apache.hadoop.hdfs.DFSOutputStream: Cannot 
allocate parity block(index=4, policy=RS-3-2-1024k). Not enough datanodes? 
Exclude nodes=[] 2022-09-30 14:24:07,302 INFO [main] 
org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed streamer 
#3: failed, blk_-9223372036808887853_3095387 2022-09-30 14:24:07,303 INFO 
[main] org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed 
streamer #4: failed, block==null 2022-09-30 14:24:07,317 WARN [main] 
org.apache.hadoop.hdfs.DFSOutputStream: Cannot allocate parity block(index=4, 
policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[] 2022-09-30 
14:24:15,383 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: replacing 
previously failed streamer #4: failed, block==null 2022-09-30 14:24:15,395 WARN 
[main] org.apache.hadoop.hdfs.DFSOutputStream: Cannot allocate parity 
block(index=4, policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[] 
2022-09-30 14:24:22,795 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: 
replacing previously failed streamer #4: failed, block==null 2022-09-30 
14:24:22,812 WARN [main] org.apache.hadoop.hdfs.DFSOutputStream: Cannot 
allocate parity block(index=3, policy=RS-3-2-1024k). Not enough datanodes? 
Exclude nodes=[] 2022-09-30 14:24:22,812 WARN [main] 
org.apache.hadoop.hdfs.DFSOutputStream: Cannot allocate parity block(index=4, 
policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[] 2022-09-30 
14:24:31,352 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: replacing 
previously failed streamer #3: failed, blk_-9223372036808887133_3095476 
2022-09-30 14:24:31,352 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: 
replacing previously failed streamer #4: failed, block==null

---
//
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/hive_warehouse/warehouse_old_snapshots/credit/.distcp.tmp.attempt_166383067_314191_m_08_2
 could only be 

[jira] [Updated] (HDFS-16788) could only be written to 2 of the 3 required nodes for RS-3-2-1024k. There are 50 datanode(s) running and no node(s) are excluded in this operation

2022-09-30 Thread ruiliang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ruiliang updated HDFS-16788:

Description: 
 

!image-2022-09-30-14-14-44-164.png!
||Configured Capacity:|3.02 PB|
||Configured Remote Capacity:|0 B|
||DFS Used:|1.39 PB (45.96%)|
||Non DFS Used:|0 B|
||DFS Remaining:|1.62 PB (53.67%)|
||Block Pool Used:|1.39 PB (45.96%)|
||DataNodes usages% (Min/Median/Max/stdDev):|8.20% / 32.44% / 98.85% / 37.30%|
||[Live 
Nodes|http://fs-hiido-yycluster06-yynn1.hiido.host.yydevops.com:50070/dfshealth.html#tab-datanode]|50
 (Decommissioned: 0, In Maintenance: 0)
 |

I've been working hard in the background to balance the data,  but before I 
discp when
{code:java}
hdfs balancer -Ddfs.datanode.balance.max.concurrent.moves=300 
-Ddfs.balancer.moverThreads=1200 
-Ddfs.datanode.balance.bandwidthPerSec=1073741824 -fs hdfs://yycluster06 
-threshold 50{code}
{code:java}
//
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/hive_warehouse/warehouse_old_snapshots/credit/.distcp.tmp.attempt_166383067_314191_m_08_2
 could only be written to 2 of the 3 required nodes for RS-3-2-1024k. There are 
50 datanode(s) running and no node(s) are excluded in this operation.
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2128)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:286)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2706)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:875)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:561)
        at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)        at 
org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1497)
        at org.apache.hadoop.ipc.Client.call(Client.java:1443)
        at org.apache.hadoop.ipc.Client.call(Client.java:1353)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
        at com.sun.proxy.$Proxy13.addBlock(Unknown Source)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:510)
        at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
        at com.sun.proxy.$Proxy14.addBlock(Unknown Source)
        at 
org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1078)
        at 
org.apache.hadoop.hdfs.DFSStripedOutputStream.allocateNewBlock(DFSStripedOutputStream.java:479)
        at 
org.apache.hadoop.hdfs.DFSStripedOutputStream.writeChunk(DFSStripedOutputStream.java:525)
        at 
org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunks(FSOutputSummer.java:217)
        at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:125)
        at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:111)
        at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:57)
        at java.io.DataOutputStream.write(DataOutputStream.java:107)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
        at 
org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyBytes(RetriableFileCopyCommand.java:290)
       

[jira] [Updated] (HDFS-16788) could only be written to 2 of the 3 required nodes for RS-3-2-1024k. There are 50 datanode(s) running and no node(s) are excluded in this operation

2022-09-30 Thread ruiliang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ruiliang updated HDFS-16788:

Description: 
 

!image-2022-09-30-14-14-44-164.png!
||Configured Capacity:|3.02 PB|
||Configured Remote Capacity:|0 B|
||DFS Used:|1.39 PB (45.96%)|
||Non DFS Used:|0 B|
||DFS Remaining:|1.62 PB (53.67%)|
||Block Pool Used:|1.39 PB (45.96%)|
||DataNodes usages% (Min/Median/Max/stdDev):|8.20% / 32.44% / 98.85% / 37.30%|
||[Live 
Nodes|http://fs-hiido-yycluster06-yynn1.hiido.host.yydevops.com:50070/dfshealth.html#tab-datanode]|50
 (Decommissioned: 0, In Maintenance: 0)
 |


I've been working hard in the background to balance the data,  but before I 
discp when
{code:java}
hdfs balancer -Ddfs.datanode.balance.max.concurrent.moves=300 
-Ddfs.balancer.moverThreads=1200 
-Ddfs.datanode.balance.bandwidthPerSec=1073741824 -fs hdfs://yycluster06 
-threshold 50{code}
{code:java}
//
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/hive_warehouse/warehouse_old_snapshots/credit/.distcp.tmp.attempt_166383067_314191_m_08_2
 could only be written to 2 of the 3 required nodes for RS-3-2-1024k. There are 
50 datanode(s) running and no node(s) are excluded in this operation.
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2128)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:286)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2706)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:875)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:561)
        at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)        at 
org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1497)
        at org.apache.hadoop.ipc.Client.call(Client.java:1443)
        at org.apache.hadoop.ipc.Client.call(Client.java:1353)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
        at com.sun.proxy.$Proxy13.addBlock(Unknown Source)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:510)
        at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
        at com.sun.proxy.$Proxy14.addBlock(Unknown Source)
        at 
org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1078)
        at 
org.apache.hadoop.hdfs.DFSStripedOutputStream.allocateNewBlock(DFSStripedOutputStream.java:479)
        at 
org.apache.hadoop.hdfs.DFSStripedOutputStream.writeChunk(DFSStripedOutputStream.java:525)
        at 
org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunks(FSOutputSummer.java:217)
        at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:125)
        at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:111)
        at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:57)
        at java.io.DataOutputStream.write(DataOutputStream.java:107)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
        at 
org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyBytes(RetriableFileCopyCommand.java:290)
      

[jira] [Created] (HDFS-16788) could only be written to 2 of the 3 required nodes for RS-3-2-1024k. There are 50 datanode(s) running and no node(s) are excluded in this operation

2022-09-30 Thread ruiliang (Jira)
ruiliang created HDFS-16788:
---

 Summary: could only be written to 2 of the 3 required nodes for 
RS-3-2-1024k. There are 50 datanode(s) running and no node(s) are excluded in 
this operation
 Key: HDFS-16788
 URL: https://issues.apache.org/jira/browse/HDFS-16788
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ruiliang
 Attachments: image-2022-09-30-14-14-29-963.png, 
image-2022-09-30-14-14-44-164.png

!image-2022-09-30-14-14-44-164.png!
||Configured Capacity:|3.02 PB|
||Configured Remote Capacity:|0 B|
||DFS Used:|1.39 PB (45.96%)|
||Non DFS Used:|0 B|
||DFS Remaining:|1.62 PB (53.67%)|
||Block Pool Used:|1.39 PB (45.96%)|
||DataNodes usages% (Min/Median/Max/stdDev):|8.20% / 32.44% / 98.85% / 37.30%|
||[Live 
Nodes|http://fs-hiido-yycluster06-yynn1.hiido.host.yydevops.com:50070/dfshealth.html#tab-datanode]|50
 (Decommissioned: 0, In Maintenance: 0)
 
|

I've been working hard in the background to balance the data,  but before I 
discp when
{code:java}
hdfs balancer -Ddfs.datanode.balance.max.concurrent.moves=300 
-Ddfs.balancer.moverThreads=1200 
-Ddfs.datanode.balance.bandwidthPerSec=1073741824 -fs hdfs://yycluster06 
-threshold 50{code}
{code:java}

//
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/hive_warehouse/warehouse_old_snapshots/credit/.distcp.tmp.attempt_166383067_314191_m_08_2
 could only be written to 2 of the 3 required nodes for RS-3-2-1024k. There are 
50 datanode(s) running and no node(s) are excluded in this operation.
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2128)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:286)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2706)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:875)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:561)
        at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)        at 
org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1497)
        at org.apache.hadoop.ipc.Client.call(Client.java:1443)
        at org.apache.hadoop.ipc.Client.call(Client.java:1353)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
        at com.sun.proxy.$Proxy13.addBlock(Unknown Source)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:510)
        at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
        at com.sun.proxy.$Proxy14.addBlock(Unknown Source)
        at 
org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1078)
        at 
org.apache.hadoop.hdfs.DFSStripedOutputStream.allocateNewBlock(DFSStripedOutputStream.java:479)
        at 
org.apache.hadoop.hdfs.DFSStripedOutputStream.writeChunk(DFSStripedOutputStream.java:525)
        at 
org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunks(FSOutputSummer.java:217)
        at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:125)
        at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:111)
        at 

[jira] [Updated] (HDFS-16788) could only be written to 2 of the 3 required nodes for RS-3-2-1024k. There are 50 datanode(s) running and no node(s) are excluded in this operation

2022-09-30 Thread ruiliang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ruiliang updated HDFS-16788:

Component/s: hdfs

> could only be written to 2 of the 3 required nodes for RS-3-2-1024k. There 
> are 50 datanode(s) running and no node(s) are excluded in this operation
> ---
>
> Key: HDFS-16788
> URL: https://issues.apache.org/jira/browse/HDFS-16788
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.1.0
>Reporter: ruiliang
>Priority: Major
> Attachments: image-2022-09-30-14-14-29-963.png, 
> image-2022-09-30-14-14-44-164.png
>
>
> !image-2022-09-30-14-14-44-164.png!
> ||Configured Capacity:|3.02 PB|
> ||Configured Remote Capacity:|0 B|
> ||DFS Used:|1.39 PB (45.96%)|
> ||Non DFS Used:|0 B|
> ||DFS Remaining:|1.62 PB (53.67%)|
> ||Block Pool Used:|1.39 PB (45.96%)|
> ||DataNodes usages% (Min/Median/Max/stdDev):|8.20% / 32.44% / 98.85% / 37.30%|
> ||[Live 
> Nodes|http://fs-hiido-yycluster06-yynn1.hiido.host.yydevops.com:50070/dfshealth.html#tab-datanode]|50
>  (Decommissioned: 0, In Maintenance: 0)
>  
> |
> I've been working hard in the background to balance the data,  but before I 
> discp when
> {code:java}
> hdfs balancer -Ddfs.datanode.balance.max.concurrent.moves=300 
> -Ddfs.balancer.moverThreads=1200 
> -Ddfs.datanode.balance.bandwidthPerSec=1073741824 -fs hdfs://yycluster06 
> -threshold 50{code}
> {code:java}
> //
> Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
> /hive_warehouse/warehouse_old_snapshots/credit/.distcp.tmp.attempt_166383067_314191_m_08_2
>  could only be written to 2 of the 3 required nodes for RS-3-2-1024k. There 
> are 50 datanode(s) running and no node(s) are excluded in this operation.
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2128)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:286)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2706)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:875)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:561)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)        
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1497)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1443)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1353)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
>         at com.sun.proxy.$Proxy13.addBlock(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:510)
>         at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>         at com.sun.proxy.$Proxy14.addBlock(Unknown Source)
>         at 
> 

[jira] [Updated] (HDFS-16788) could only be written to 2 of the 3 required nodes for RS-3-2-1024k. There are 50 datanode(s) running and no node(s) are excluded in this operation

2022-09-30 Thread ruiliang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ruiliang updated HDFS-16788:

Affects Version/s: 3.1.0

> could only be written to 2 of the 3 required nodes for RS-3-2-1024k. There 
> are 50 datanode(s) running and no node(s) are excluded in this operation
> ---
>
> Key: HDFS-16788
> URL: https://issues.apache.org/jira/browse/HDFS-16788
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.1.0
>Reporter: ruiliang
>Priority: Major
> Attachments: image-2022-09-30-14-14-29-963.png, 
> image-2022-09-30-14-14-44-164.png
>
>
> !image-2022-09-30-14-14-44-164.png!
> ||Configured Capacity:|3.02 PB|
> ||Configured Remote Capacity:|0 B|
> ||DFS Used:|1.39 PB (45.96%)|
> ||Non DFS Used:|0 B|
> ||DFS Remaining:|1.62 PB (53.67%)|
> ||Block Pool Used:|1.39 PB (45.96%)|
> ||DataNodes usages% (Min/Median/Max/stdDev):|8.20% / 32.44% / 98.85% / 37.30%|
> ||[Live 
> Nodes|http://fs-hiido-yycluster06-yynn1.hiido.host.yydevops.com:50070/dfshealth.html#tab-datanode]|50
>  (Decommissioned: 0, In Maintenance: 0)
>  
> |
> I've been working hard in the background to balance the data,  but before I 
> discp when
> {code:java}
> hdfs balancer -Ddfs.datanode.balance.max.concurrent.moves=300 
> -Ddfs.balancer.moverThreads=1200 
> -Ddfs.datanode.balance.bandwidthPerSec=1073741824 -fs hdfs://yycluster06 
> -threshold 50{code}
> {code:java}
> //
> Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
> /hive_warehouse/warehouse_old_snapshots/credit/.distcp.tmp.attempt_166383067_314191_m_08_2
>  could only be written to 2 of the 3 required nodes for RS-3-2-1024k. There 
> are 50 datanode(s) running and no node(s) are excluded in this operation.
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2128)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:286)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2706)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:875)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:561)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)        
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1497)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1443)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1353)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
>         at com.sun.proxy.$Proxy13.addBlock(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:510)
>         at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>         at com.sun.proxy.$Proxy14.addBlock(Unknown Source)
>         at 
>