[jira] [Assigned] (HDFS-17596) [ARR] RouterStoragePolicy supports asynchronous rpc.

2024-07-26 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba reassigned HDFS-17596:


Assignee: farmmamba

> [ARR] RouterStoragePolicy supports asynchronous rpc.
> 
>
> Key: HDFS-17596
> URL: https://issues.apache.org/jira/browse/HDFS-17596
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jian Zhang
>Assignee: farmmamba
>Priority: Major
>
> *Describe*
> The main new addition is RouterAsyncStoragePolicy, which extends 
> RouterStoragePolicy so that supports asynchronous rpc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17595) [ARR] ErasureCoding supports asynchronous rpc.

2024-07-26 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba reassigned HDFS-17595:


Assignee: farmmamba

> [ARR] ErasureCoding supports asynchronous rpc.
> --
>
> Key: HDFS-17595
> URL: https://issues.apache.org/jira/browse/HDFS-17595
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jian Zhang
>Assignee: farmmamba
>Priority: Major
>
> *Describe*
> The main new addition is AsyncErasureCoding, which extends ErasureCoding so 
> that supports asynchronous rpc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17597) [ARR] RouterSnapshot supports asynchronous rpc.

2024-07-26 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba reassigned HDFS-17597:


Assignee: farmmamba

> [ARR] RouterSnapshot supports asynchronous rpc.
> ---
>
> Key: HDFS-17597
> URL: https://issues.apache.org/jira/browse/HDFS-17597
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jian Zhang
>Assignee: farmmamba
>Priority: Major
>
> *Describe*
> The main new addition is RouterAsyncSnapshot, which extends RouterSnapshot so 
> that supports asynchronous rpc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-17589) hdfs EC data new blk reconstruct old blk not delete

2024-07-22 Thread farmmamba (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17867935#comment-17867935
 ] 

farmmamba edited comment on HDFS-17589 at 7/23/24 3:07 AM:
---

[~ruilaing] Sir, this problem has not been resolved in current trunk. We alse 
met this problem in our production env.

We fixed this problem by modifying directory scanner logic.


was (Author: zhanghaobo):
[~ruilaing] Sir, this problem has not been resolved in current trunk. We alse 
met this problem in our production env.

We fix this problem by modifying directory scanner logic.

> hdfs EC data  new blk reconstruct   old blk not delete
> --
>
> Key: HDFS-17589
> URL: https://issues.apache.org/jira/browse/HDFS-17589
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.1.1
>Reporter: ruiliang
>Priority: Major
>
> The reason is that the cluster was faulty before, and Datanodes kept losing 
> connections and recovering, resulting in a lot of EC data reconstruct, but a 
> lot of old blk failed to clean up correctly. Has this been repaired? What 
> patch do I need to add, thank you
> The following is a detailed check log
>  
> ok:     blk_-9223372036371044652  in 10.12.66.225  
> {color:#de350b}error:  blk_-9223372036371044652 in  
> 10.12.66.154(/data3/hadoop/dfs/data/current/BP-1822992414-10.12.65.48-1660893388633/current/finalized/subdir21/subdir6/blk_-9223372036371044652)
>  {color}
> {color:#de350b}Why didn't you delete it?{color}
>  
> {code:java}
> datanode delete data ec blk ?
>  grep blk_-9223372036371044656  
> hadoop-hdfs-root-datanode-fs-hiido-dn-12-66-111.hiido.host.xxx.com.log
> 2024-07-18 17:25:07,879 INFO  datanode.DataNode 
> (DataXceiver.java:writeBlock(738)) - Receiving 
> BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036371044656_1688858793 
> src: /10.12.66.111:25066 dest: /10.12.66.111:1019
> 2024-07-18 17:25:17,396 INFO  datanode.DataNode 
> (StripedBlockReconstructor.java:run(86)) - ok EC reconstruct striped block: 
> BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036371044656_1688858793  
> blockId: -9223372036371044656
> 2024-07-18 17:25:17,396 INFO  datanode.DataNode 
> (DataXceiver.java:writeBlock(914)) - Received 
> BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036371044656_1688858793 
> src: /10.12.66.111:25066 dest: /10.12.66.111:1019 of size 193986560
> 2024-07-18 17:25:25,465 INFO  impl.FsDatasetAsyncDiskService 
> (FsDatasetAsyncDiskService.java:deleteAsync(225)) - Scheduling 
> blk_-9223372036371044656_1688858793 replica FinalizedReplica, 
> blk_-9223372036371044656_1688858793, FINALIZED
>   getBlockURI()     = 
> file:/data4/hadoop/dfs/data/current/BP-1822992414-10.12.65.48-1660893388633/current/finalized/subdir21/subdir6/blk_-9223372036371044656
>  for deletion
> 2024-07-18 17:25:25,746 INFO  impl.FsDatasetAsyncDiskService 
> (FsDatasetAsyncDiskService.java:run(333)) - Deleted 
> BP-1822992414-10.12.65.48-1660893388633 blk_-9223372036371044656_1688858793 
> URI 
> file:/data4/hadoop/dfs/data/current/BP-1822992414-10.12.65.48-1660893388633/current/finalized/subdir21/subdir6/blk_-9223372036371044656=my
>  config
> dfs.blockreport.intervalMsec    =2160namenode3 log
> hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18
>  04:34:39,523 WARN  BlockStateChange (BlockManager.java:addStoredBlock(3238)) 
> - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to 
> storageType DISK on node 10.12.66.154:1019
> hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18
>  04:34:40,131 WARN  BlockStateChange (BlockManager.java:addStoredBlock(3238)) 
> - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to 
> storageType DISK on node 10.12.66.154:1019
> hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18
>  10:34:38,950 WARN  BlockStateChange (BlockManager.java:addStoredBlock(3238)) 
> - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to 
> storageType DISK on node 10.12.66.154:1019
> hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18
>  10:34:39,559 WARN  BlockStateChange (BlockManager.java:addStoredBlock(3238)) 
> - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to 
> storageType DISK on node 10.12.66.154:1019
> hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log:2024-07-18
>  16:34:38,564 WARN  BlockStateChange (BlockManager.java:addStoredBlock(3238)) 
> - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to 
> storageType DISK on node 10.12.66.154:1019
> hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log:2024-07-18
>  16:34:39,190 WARN  

[jira] [Commented] (HDFS-17589) hdfs EC data new blk reconstruct old blk not delete

2024-07-22 Thread farmmamba (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17867935#comment-17867935
 ] 

farmmamba commented on HDFS-17589:
--

[~ruilaing] Sir, this problem has not been resolved in current trunk. We alse 
met this problem in our production env.

We fix this problem by using directory scanner.

> hdfs EC data  new blk reconstruct   old blk not delete
> --
>
> Key: HDFS-17589
> URL: https://issues.apache.org/jira/browse/HDFS-17589
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.1.1
>Reporter: ruiliang
>Priority: Major
>
> The reason is that the cluster was faulty before, and Datanodes kept losing 
> connections and recovering, resulting in a lot of EC data reconstruct, but a 
> lot of old blk failed to clean up correctly. Has this been repaired? What 
> patch do I need to add, thank you
> The following is a detailed check log
>  
> ok:     blk_-9223372036371044652  in 10.12.66.225  
> {color:#de350b}error:  blk_-9223372036371044652 in  
> 10.12.66.154(/data3/hadoop/dfs/data/current/BP-1822992414-10.12.65.48-1660893388633/current/finalized/subdir21/subdir6/blk_-9223372036371044652)
>  {color}
> {color:#de350b}Why didn't you delete it?{color}
>  
> {code:java}
> datanode delete data ec blk ?
>  grep blk_-9223372036371044656  
> hadoop-hdfs-root-datanode-fs-hiido-dn-12-66-111.hiido.host.xxx.com.log
> 2024-07-18 17:25:07,879 INFO  datanode.DataNode 
> (DataXceiver.java:writeBlock(738)) - Receiving 
> BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036371044656_1688858793 
> src: /10.12.66.111:25066 dest: /10.12.66.111:1019
> 2024-07-18 17:25:17,396 INFO  datanode.DataNode 
> (StripedBlockReconstructor.java:run(86)) - ok EC reconstruct striped block: 
> BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036371044656_1688858793  
> blockId: -9223372036371044656
> 2024-07-18 17:25:17,396 INFO  datanode.DataNode 
> (DataXceiver.java:writeBlock(914)) - Received 
> BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036371044656_1688858793 
> src: /10.12.66.111:25066 dest: /10.12.66.111:1019 of size 193986560
> 2024-07-18 17:25:25,465 INFO  impl.FsDatasetAsyncDiskService 
> (FsDatasetAsyncDiskService.java:deleteAsync(225)) - Scheduling 
> blk_-9223372036371044656_1688858793 replica FinalizedReplica, 
> blk_-9223372036371044656_1688858793, FINALIZED
>   getBlockURI()     = 
> file:/data4/hadoop/dfs/data/current/BP-1822992414-10.12.65.48-1660893388633/current/finalized/subdir21/subdir6/blk_-9223372036371044656
>  for deletion
> 2024-07-18 17:25:25,746 INFO  impl.FsDatasetAsyncDiskService 
> (FsDatasetAsyncDiskService.java:run(333)) - Deleted 
> BP-1822992414-10.12.65.48-1660893388633 blk_-9223372036371044656_1688858793 
> URI 
> file:/data4/hadoop/dfs/data/current/BP-1822992414-10.12.65.48-1660893388633/current/finalized/subdir21/subdir6/blk_-9223372036371044656=my
>  config
> dfs.blockreport.intervalMsec    =2160namenode3 log
> hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18
>  04:34:39,523 WARN  BlockStateChange (BlockManager.java:addStoredBlock(3238)) 
> - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to 
> storageType DISK on node 10.12.66.154:1019
> hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18
>  04:34:40,131 WARN  BlockStateChange (BlockManager.java:addStoredBlock(3238)) 
> - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to 
> storageType DISK on node 10.12.66.154:1019
> hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18
>  10:34:38,950 WARN  BlockStateChange (BlockManager.java:addStoredBlock(3238)) 
> - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to 
> storageType DISK on node 10.12.66.154:1019
> hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18
>  10:34:39,559 WARN  BlockStateChange (BlockManager.java:addStoredBlock(3238)) 
> - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to 
> storageType DISK on node 10.12.66.154:1019
> hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log:2024-07-18
>  16:34:38,564 WARN  BlockStateChange (BlockManager.java:addStoredBlock(3238)) 
> - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to 
> storageType DISK on node 10.12.66.154:1019
> hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log:2024-07-18
>  16:34:39,190 WARN  BlockStateChange (BlockManager.java:addStoredBlock(3238)) 
> - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to 
> storageType DISK on node 10.12.66.154:1019
> hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-17
>  

[jira] [Comment Edited] (HDFS-17589) hdfs EC data new blk reconstruct old blk not delete

2024-07-22 Thread farmmamba (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17867935#comment-17867935
 ] 

farmmamba edited comment on HDFS-17589 at 7/23/24 3:07 AM:
---

[~ruilaing] Sir, this problem has not been resolved in current trunk. We alse 
met this problem in our production env.

We fix this problem by modifying directory scanner logic.


was (Author: zhanghaobo):
[~ruilaing] Sir, this problem has not been resolved in current trunk. We alse 
met this problem in our production env.

We fix this problem by using directory scanner.

> hdfs EC data  new blk reconstruct   old blk not delete
> --
>
> Key: HDFS-17589
> URL: https://issues.apache.org/jira/browse/HDFS-17589
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.1.1
>Reporter: ruiliang
>Priority: Major
>
> The reason is that the cluster was faulty before, and Datanodes kept losing 
> connections and recovering, resulting in a lot of EC data reconstruct, but a 
> lot of old blk failed to clean up correctly. Has this been repaired? What 
> patch do I need to add, thank you
> The following is a detailed check log
>  
> ok:     blk_-9223372036371044652  in 10.12.66.225  
> {color:#de350b}error:  blk_-9223372036371044652 in  
> 10.12.66.154(/data3/hadoop/dfs/data/current/BP-1822992414-10.12.65.48-1660893388633/current/finalized/subdir21/subdir6/blk_-9223372036371044652)
>  {color}
> {color:#de350b}Why didn't you delete it?{color}
>  
> {code:java}
> datanode delete data ec blk ?
>  grep blk_-9223372036371044656  
> hadoop-hdfs-root-datanode-fs-hiido-dn-12-66-111.hiido.host.xxx.com.log
> 2024-07-18 17:25:07,879 INFO  datanode.DataNode 
> (DataXceiver.java:writeBlock(738)) - Receiving 
> BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036371044656_1688858793 
> src: /10.12.66.111:25066 dest: /10.12.66.111:1019
> 2024-07-18 17:25:17,396 INFO  datanode.DataNode 
> (StripedBlockReconstructor.java:run(86)) - ok EC reconstruct striped block: 
> BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036371044656_1688858793  
> blockId: -9223372036371044656
> 2024-07-18 17:25:17,396 INFO  datanode.DataNode 
> (DataXceiver.java:writeBlock(914)) - Received 
> BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036371044656_1688858793 
> src: /10.12.66.111:25066 dest: /10.12.66.111:1019 of size 193986560
> 2024-07-18 17:25:25,465 INFO  impl.FsDatasetAsyncDiskService 
> (FsDatasetAsyncDiskService.java:deleteAsync(225)) - Scheduling 
> blk_-9223372036371044656_1688858793 replica FinalizedReplica, 
> blk_-9223372036371044656_1688858793, FINALIZED
>   getBlockURI()     = 
> file:/data4/hadoop/dfs/data/current/BP-1822992414-10.12.65.48-1660893388633/current/finalized/subdir21/subdir6/blk_-9223372036371044656
>  for deletion
> 2024-07-18 17:25:25,746 INFO  impl.FsDatasetAsyncDiskService 
> (FsDatasetAsyncDiskService.java:run(333)) - Deleted 
> BP-1822992414-10.12.65.48-1660893388633 blk_-9223372036371044656_1688858793 
> URI 
> file:/data4/hadoop/dfs/data/current/BP-1822992414-10.12.65.48-1660893388633/current/finalized/subdir21/subdir6/blk_-9223372036371044656=my
>  config
> dfs.blockreport.intervalMsec    =2160namenode3 log
> hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18
>  04:34:39,523 WARN  BlockStateChange (BlockManager.java:addStoredBlock(3238)) 
> - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to 
> storageType DISK on node 10.12.66.154:1019
> hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18
>  04:34:40,131 WARN  BlockStateChange (BlockManager.java:addStoredBlock(3238)) 
> - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to 
> storageType DISK on node 10.12.66.154:1019
> hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18
>  10:34:38,950 WARN  BlockStateChange (BlockManager.java:addStoredBlock(3238)) 
> - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to 
> storageType DISK on node 10.12.66.154:1019
> hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18
>  10:34:39,559 WARN  BlockStateChange (BlockManager.java:addStoredBlock(3238)) 
> - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to 
> storageType DISK on node 10.12.66.154:1019
> hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log:2024-07-18
>  16:34:38,564 WARN  BlockStateChange (BlockManager.java:addStoredBlock(3238)) 
> - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to 
> storageType DISK on node 10.12.66.154:1019
> hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log:2024-07-18
>  16:34:39,190 WARN  BlockStateChange 

[jira] [Resolved] (HDFS-17580) Change the default value of dfs.datanode.lock.fair to false due to potential hang

2024-07-15 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba resolved HDFS-17580.
--
Resolution: Won't Fix

Only change to false can not prevent datanode hang. 

We need some other methods.

> Change the default value of dfs.datanode.lock.fair to false due to potential 
> hang
> -
>
> Key: HDFS-17580
> URL: https://issues.apache.org/jira/browse/HDFS-17580
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.4.0
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17580) Change the default value of dfs.datanode.lock.fair to false due to potential hang

2024-07-15 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba updated HDFS-17580:
-
Summary: Change the default value of dfs.datanode.lock.fair to false due to 
potential hang  (was: Change the default value of dfs.datanode.lock.fair to 
false)

> Change the default value of dfs.datanode.lock.fair to false due to potential 
> hang
> -
>
> Key: HDFS-17580
> URL: https://issues.apache.org/jira/browse/HDFS-17580
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.4.0
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17580) Change the default value of dfs.datanode.lock.fair to false

2024-07-15 Thread farmmamba (Jira)
farmmamba created HDFS-17580:


 Summary: Change the default value of dfs.datanode.lock.fair to 
false
 Key: HDFS-17580
 URL: https://issues.apache.org/jira/browse/HDFS-17580
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 3.4.0
Reporter: farmmamba
Assignee: farmmamba






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17567) Return value of method RouterRpcClient#invokeSequential is not accurate

2024-07-03 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba updated HDFS-17567:
-
Description: 
Below code is the return value in method RouterRpcClient#invokeSequential.

 
{code:java}
// Return the first result, whether it is the value or not
@SuppressWarnings("unchecked") T ret = (T) firstResult;
return new RemoteResult<>(locations.get(0), ret); {code}
 

 

`locations.get(0)` is not accurate, because it may not be the remote location 
where ret was returned.

> Return value of method RouterRpcClient#invokeSequential is not accurate
> ---
>
> Key: HDFS-17567
> URL: https://issues.apache.org/jira/browse/HDFS-17567
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.4.0
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>
> Below code is the return value in method RouterRpcClient#invokeSequential.
>  
> {code:java}
> // Return the first result, whether it is the value or not
> @SuppressWarnings("unchecked") T ret = (T) firstResult;
> return new RemoteResult<>(locations.get(0), ret); {code}
>  
>  
> `locations.get(0)` is not accurate, because it may not be the remote location 
> where ret was returned.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17567) Return value of method RouterRpcClient#invokeSequential is not accurate

2024-07-03 Thread farmmamba (Jira)
farmmamba created HDFS-17567:


 Summary: Return value of method RouterRpcClient#invokeSequential 
is not accurate
 Key: HDFS-17567
 URL: https://issues.apache.org/jira/browse/HDFS-17567
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: rbf
Affects Versions: 3.4.0
Reporter: farmmamba
Assignee: farmmamba






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17551) Fix unit test failure caused by HDFS-17464

2024-06-12 Thread farmmamba (Jira)
farmmamba created HDFS-17551:


 Summary: Fix unit test failure caused by HDFS-17464
 Key: HDFS-17551
 URL: https://issues.apache.org/jira/browse/HDFS-17551
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: farmmamba
Assignee: farmmamba


As title.

This Jira is used to fix unit test failure caused by HDFS-17464.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17464) Improve some logs output in class FsDatasetImpl

2024-06-12 Thread farmmamba (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854304#comment-17854304
 ] 

farmmamba commented on HDFS-17464:
--

@Ayush Saxena (Jira) Sir, thanks for reminding me. Will fix it soon.


张浩博
hfutzhan...@163.com


 Replied Message 

[ 
https://issues.apache.org/jira/browse/HDFS-17464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854246#comment-17854246
 ]

Ayush Saxena commented on HDFS-17464:
-

[~zhanghaobo] this seems to be leading to a test failure
https://ci-hadoop.apache.org/view/Hadoop/job/hadoop-qbt-trunk-java8-linux-x86_64/1604/testReport/junit/org.apache.hadoop.hdfs.server.datanode.fsdataset.impl/TestFsDatasetImpl/testMoveBlockFailure/

I think it is asserting the error message, can you shoot an Addendum PR to fix 
the test?


Improve some logs output in class FsDatasetImpl
---

Key: HDFS-17464
URL: https://issues.apache.org/jira/browse/HDFS-17464
Project: Hadoop HDFS
Issue Type: Improvement
Components: datanode
Affects Versions: 3.5.0
Reporter: farmmamba
Assignee: farmmamba
Priority: Minor
Labels: pull-request-available
Fix For: 3.5.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org


> Improve some logs output in class FsDatasetImpl
> ---
>
> Key: HDFS-17464
> URL: https://issues.apache.org/jira/browse/HDFS-17464
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.5.0
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17474) [FGL] Make INodeMap thread safe

2024-05-27 Thread farmmamba (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849869#comment-17849869
 ] 

farmmamba commented on HDFS-17474:
--

[~coconut_icecream] Hi, sir. Thanks a lot for your sharing and responding. I 
prefer to your 3rd option.

I have done this with ReentrantReadWriteLock in my test branch but have not 
benchmarked.

And I think the 1st option will even have worse performance than the original 
code because it is totally serial execution.  does it?

Hope to receive your benchmarking and thanks for your work again.

 

> [FGL] Make INodeMap thread safe
> ---
>
> Key: HDFS-17474
> URL: https://issues.apache.org/jira/browse/HDFS-17474
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: ZanderXu
>Assignee: Felix N
>Priority: Major
>
> Operations related to INodeMap should be handled by namenode safety, since 
> operations maybe access or update INodeMap concurrently.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17511) method storagespaceConsumedContiguous should use BlockInfo#getReplication to compute dsDelta

2024-05-06 Thread farmmamba (Jira)
farmmamba created HDFS-17511:


 Summary: method storagespaceConsumedContiguous should use 
BlockInfo#getReplication to compute dsDelta
 Key: HDFS-17511
 URL: https://issues.apache.org/jira/browse/HDFS-17511
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: farmmamba
Assignee: farmmamba


As title says, we should use BlockInfo#getReplication to compute storage space 
in method INodeFile#storagespaceConsumedContiguous.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16368) DFSAdmin supports refresh topology info without restarting namenode

2024-04-29 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba reassigned HDFS-16368:


Assignee: farmmamba

>  DFSAdmin supports refresh topology info without restarting namenode
> 
>
> Key: HDFS-16368
> URL: https://issues.apache.org/jira/browse/HDFS-16368
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: dfsadmin, namanode
>Affects Versions: 2.7.7, 3.3.1
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>  Labels: features, pull-request-available
> Attachments: 0001.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently in HDFS, if we update the rack info for rack-awareness, we may need 
> to rolling restart namenodes to let it be effective. If cluster is large, the 
> cost time of rolling restart namenodes is very log. So, we develope a method 
> to refresh topology info without rolling restart namenodes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17477) IncrementalBlockReport race condition additional edge cases

2024-04-24 Thread farmmamba (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840302#comment-17840302
 ] 

farmmamba commented on HDFS-17477:
--

Sir, I also met this problem. Failed with OOM.


张浩博
hfutzhan...@163.com


 Replied Message 

[ 
https://issues.apache.org/jira/browse/HDFS-17477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840294#comment-17840294
 ]

Ayush Saxena commented on HDFS-17477:
-

Hi [~dannytbecker] 

Seems like since this got committed 
TestLargeBlockReport#testBlockReportSucceedsWithLargerLengthLimit is failing 

ref:

[https://ci-hadoop.apache.org/view/Hadoop/job/hadoop-qbt-trunk-java8-linux-x86_64/1564/testReport/junit/org.apache.hadoop.hdfs.server.datanode/TestLargeBlockReport/testBlockReportSucceedsWithLargerLengthLimit/]

 

It did fail once in the Jenkins result of this PR as well:

[https://github.com/apache/hadoop/pull/6748#issuecomment-2063042088]

 

But in the successive build, I am not sure if it ran or not. 

 

Tried locally, with this in locally it was failing with OOM, I reverted it & it 
passed.

Can you check once?

IncrementalBlockReport race condition additional edge cases
---

Key: HDFS-17477
URL: https://issues.apache.org/jira/browse/HDFS-17477
Project: Hadoop HDFS
Issue Type: Bug
Components: auto-failover, ha, namenode
Affects Versions: 3.3.5, 3.3.4, 3.3.6
Reporter: Danny Becker
Assignee: Danny Becker
Priority: Major
Labels: pull-request-available

HDFS-17453 fixes a race condition between IncrementalBlockReports (IBR) and the 
Edit Log Tailer which can cause the Standby NameNode (SNN) to incorrectly mark 
blocks as corrupt when it transitions to Active. There are a few edge cases 
that HDFS-17453 does not cover.
For Example:
1. SNN1 loads the edits for b1gs1 and b1gs2.
2. DN1 reports b1gs1 to SNN1, so it gets queued for later processing.
3. DN1 reports b1gs2 to SNN1 so it gets added to the blocks map.
4. SNN1 transitions to Active (ANN1).
5. ANN1 processes the pending DN message queue and marks DN1->b1gs1 as corrupt 
because it was still in the queue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org


> IncrementalBlockReport race condition additional edge cases
> ---
>
> Key: HDFS-17477
> URL: https://issues.apache.org/jira/browse/HDFS-17477
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: auto-failover, ha, namenode
>Affects Versions: 3.3.5, 3.3.4, 3.3.6
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Major
>  Labels: pull-request-available
>
> HDFS-17453 fixes a race condition between IncrementalBlockReports (IBR) and 
> the Edit Log Tailer which can cause the Standby NameNode (SNN) to incorrectly 
> mark blocks as corrupt when it transitions to Active. There are a few edge 
> cases that HDFS-17453 does not cover.
> For Example:
> 1. SNN1 loads the edits for b1gs1 and b1gs2.
> 2. DN1 reports b1gs1 to SNN1, so it gets queued for later processing.
> 3. DN1 reports b1gs2 to SNN1 so it gets added to the blocks map.
> 4. SNN1 transitions to Active (ANN1).
> 5. ANN1 processes the pending DN message queue and marks DN1->b1gs1 as 
> corrupt because it was still in the queue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17496) DataNode supports more fine-grained dataset lock based on blockid

2024-04-23 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba updated HDFS-17496:
-
Description: 
Recently, we used NvmeSSD as volumes in datanodes and performed some stress 
tests.

We found that NvmeSSD and HDD disks achieve similar performance when create 
lots of small files, such as 10KB.

This phenomenon is counterintuitive.  After analyzing the metric monitoring , 
we found that fsdataset lock became the bottleneck in high concurrency scenario.

 

Currently, we have two level locks which are BLOCK_POOL and VOLUME.  We can 
further split the volume lock to DIR lock.

DIR lock is defined as below: given a blockid, we can determine which subdir 
this block will be placed in finalized dir. We just use 
subdir[0-31]/subdir[0-31] as the

name of DIR lock.

More details, please refer to method DatanodeUtil#idToBlockDir:
{code:java}
  public static File idToBlockDir(File root, long blockId) {
    int d1 = (int) ((blockId >> 16) & 0x1F);
    int d2 = (int) ((blockId >> 8) & 0x1F);
    String path = DataStorage.BLOCK_SUBDIR_PREFIX + d1 + SEP +
        DataStorage.BLOCK_SUBDIR_PREFIX + d2;
    return new File(root, path);
  } {code}
The performance comparison is as below:

experimental setup:

3 DataNodes with single disk.

10 Cients concurrent write and delete files after writing.

550 threads per Client.

!image-2024-04-23-16-17-07-057.png!

 

> DataNode supports more fine-grained dataset lock based on blockid
> -
>
> Key: HDFS-17496
> URL: https://issues.apache.org/jira/browse/HDFS-17496
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
> Attachments: image-2024-04-23-16-17-07-057.png
>
>
> Recently, we used NvmeSSD as volumes in datanodes and performed some stress 
> tests.
> We found that NvmeSSD and HDD disks achieve similar performance when create 
> lots of small files, such as 10KB.
> This phenomenon is counterintuitive.  After analyzing the metric monitoring , 
> we found that fsdataset lock became the bottleneck in high concurrency 
> scenario.
>  
> Currently, we have two level locks which are BLOCK_POOL and VOLUME.  We can 
> further split the volume lock to DIR lock.
> DIR lock is defined as below: given a blockid, we can determine which subdir 
> this block will be placed in finalized dir. We just use 
> subdir[0-31]/subdir[0-31] as the
> name of DIR lock.
> More details, please refer to method DatanodeUtil#idToBlockDir:
> {code:java}
>   public static File idToBlockDir(File root, long blockId) {
>     int d1 = (int) ((blockId >> 16) & 0x1F);
>     int d2 = (int) ((blockId >> 8) & 0x1F);
>     String path = DataStorage.BLOCK_SUBDIR_PREFIX + d1 + SEP +
>         DataStorage.BLOCK_SUBDIR_PREFIX + d2;
>     return new File(root, path);
>   } {code}
> The performance comparison is as below:
> experimental setup:
> 3 DataNodes with single disk.
> 10 Cients concurrent write and delete files after writing.
> 550 threads per Client.
> !image-2024-04-23-16-17-07-057.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17496) DataNode supports more fine-grained dataset lock based on blockid

2024-04-23 Thread farmmamba (Jira)
farmmamba created HDFS-17496:


 Summary: DataNode supports more fine-grained dataset lock based on 
blockid
 Key: HDFS-17496
 URL: https://issues.apache.org/jira/browse/HDFS-17496
 Project: Hadoop HDFS
  Issue Type: Task
  Components: datanode
Reporter: farmmamba
Assignee: farmmamba






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17488) DN can fail IBRs with NPE when a volume is removed

2024-04-22 Thread farmmamba (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839558#comment-17839558
 ] 

farmmamba commented on HDFS-17488:
--

Hi, [~coconut_icecream] .  Sir, of course. I have look at your code roughly and 
will review it more carefully later.

> DN can fail IBRs with NPE when a volume is removed
> --
>
> Key: HDFS-17488
> URL: https://issues.apache.org/jira/browse/HDFS-17488
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Felix N
>Assignee: Felix N
>Priority: Major
>  Labels: pull-request-available
>
>  
> Error logs
> {code:java}
> 2024-04-22 15:46:33,422 [BP-1842952724-10.22.68.249-1713771988830 
> heartbeating to localhost/127.0.0.1:64977] ERROR datanode.DataNode 
> (BPServiceActor.java:run(922)) - Exception in BPOfferService for Block pool 
> BP-1842952724-10.22.68.249-1713771988830 (Datanode Uuid 
> 1659ffaf-1a80-4a8e-a542-643f6bd97ed4) service to localhost/127.0.0.1:64977
> java.lang.NullPointerException
>     at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReceivedAndDeleted(DatanodeProtocolClientSideTranslatorPB.java:246)
>     at 
> org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager.sendIBRs(IncrementalBlockReportManager.java:218)
>     at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:749)
>     at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:920)
>     at java.lang.Thread.run(Thread.java:748) {code}
> The root cause is in BPOfferService#notifyNamenodeBlock, happens when it's 
> called on a block belonging to a volume already removed prior. Because the 
> volume was already removed
>  
> {code:java}
> private void notifyNamenodeBlock(ExtendedBlock block, BlockStatus status,
> String delHint, String storageUuid, boolean isOnTransientStorage) {
>   checkBlock(block);
>   final ReceivedDeletedBlockInfo info = new ReceivedDeletedBlockInfo(
>   block.getLocalBlock(), status, delHint);
>   final DatanodeStorage storage = dn.getFSDataset().getStorage(storageUuid);
>   
>   // storage == null here because it's already removed earlier.
>   for (BPServiceActor actor : bpServices) {
> actor.getIbrManager().notifyNamenodeBlock(info, storage,
> isOnTransientStorage);
>   }
> } {code}
> so IBRs with a null storage are now pending.
> The reason why notifyNamenodeBlock can trigger on such blocks is up in 
> DirectoryScanner#reconcile
> {code:java}
>   public void reconcile() throws IOException {
>     LOG.debug("reconcile start DirectoryScanning");
>     scan();
> // If a volume is removed here after scan() already finished running,
> // diffs is stale and checkAndUpdate will run on a removed volume
>     // HDFS-14476: run checkAndUpdate with batch to avoid holding the lock too
>     // long
>     int loopCount = 0;
>     synchronized (diffs) {
>       for (final Map.Entry entry : diffs.getEntries()) {
>         dataset.checkAndUpdate(entry.getKey(), entry.getValue());        
>     ...
>   } {code}
> Inside checkAndUpdate, memBlockInfo is null because all the block meta in 
> memory is removed during the volume removal, but diskFile still exists. Then 
> DataNode#notifyNamenodeDeletedBlock (and further down the line, 
> notifyNamenodeBlock) is called on this block.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-17488) DN can fail IBRs with NPE when a volume is removed

2024-04-22 Thread farmmamba (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839543#comment-17839543
 ] 

farmmamba edited comment on HDFS-17488 at 4/22/24 8:15 AM:
---

[~coconut_icecream] Sir, Thanks for your reporting, i think it is duplicated. 
Please refer to HDFS-17467.


was (Author: zhanghaobo):
[~coconut_icecream] Sir, it is duplicated. Please refer to HDFS-17467.

> DN can fail IBRs with NPE when a volume is removed
> --
>
> Key: HDFS-17488
> URL: https://issues.apache.org/jira/browse/HDFS-17488
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Felix N
>Assignee: Felix N
>Priority: Major
>
>  
> Error logs
> {code:java}
> 2024-04-22 15:46:33,422 [BP-1842952724-10.22.68.249-1713771988830 
> heartbeating to localhost/127.0.0.1:64977] ERROR datanode.DataNode 
> (BPServiceActor.java:run(922)) - Exception in BPOfferService for Block pool 
> BP-1842952724-10.22.68.249-1713771988830 (Datanode Uuid 
> 1659ffaf-1a80-4a8e-a542-643f6bd97ed4) service to localhost/127.0.0.1:64977
> java.lang.NullPointerException
>     at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReceivedAndDeleted(DatanodeProtocolClientSideTranslatorPB.java:246)
>     at 
> org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager.sendIBRs(IncrementalBlockReportManager.java:218)
>     at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:749)
>     at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:920)
>     at java.lang.Thread.run(Thread.java:748) {code}
> The root cause is in BPOfferService#notifyNamenodeBlock, happens when it's 
> called on a block belonging to a volume already removed prior. Because the 
> volume was already removed
>  
> {code:java}
> private void notifyNamenodeBlock(ExtendedBlock block, BlockStatus status,
> String delHint, String storageUuid, boolean isOnTransientStorage) {
>   checkBlock(block);
>   final ReceivedDeletedBlockInfo info = new ReceivedDeletedBlockInfo(
>   block.getLocalBlock(), status, delHint);
>   final DatanodeStorage storage = dn.getFSDataset().getStorage(storageUuid);
>   
>   // storage == null here because it's already removed earlier.
>   for (BPServiceActor actor : bpServices) {
> actor.getIbrManager().notifyNamenodeBlock(info, storage,
> isOnTransientStorage);
>   }
> } {code}
> so IBRs with a null storage are now pending.
> The reason why notifyNamenodeBlock can trigger on such blocks is up in 
> DirectoryScanner#reconcile
> {code:java}
>   public void reconcile() throws IOException {
>     LOG.debug("reconcile start DirectoryScanning");
>     scan();
> // If a volume is removed here after scan() already finished running,
> // diffs is stale and checkAndUpdate will run on a removed volume
>     // HDFS-14476: run checkAndUpdate with batch to avoid holding the lock too
>     // long
>     int loopCount = 0;
>     synchronized (diffs) {
>       for (final Map.Entry entry : diffs.getEntries()) {
>         dataset.checkAndUpdate(entry.getKey(), entry.getValue());        
>     ...
>   } {code}
> Inside checkAndUpdate, memBlockInfo is null because all the block meta in 
> memory is removed during the volume removal, but diskFile still exists. Then 
> DataNode#notifyNamenodeDeletedBlock (and further down the line, 
> notifyNamenodeBlock) is called on this block.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17488) DN can fail IBRs with NPE when a volume is removed

2024-04-22 Thread farmmamba (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839543#comment-17839543
 ] 

farmmamba commented on HDFS-17488:
--

[~coconut_icecream] Sir, it is duplicated. Please refer to HDFS-17467.

> DN can fail IBRs with NPE when a volume is removed
> --
>
> Key: HDFS-17488
> URL: https://issues.apache.org/jira/browse/HDFS-17488
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Felix N
>Assignee: Felix N
>Priority: Major
>
>  
> Error logs
> {code:java}
> 2024-04-22 15:46:33,422 [BP-1842952724-10.22.68.249-1713771988830 
> heartbeating to localhost/127.0.0.1:64977] ERROR datanode.DataNode 
> (BPServiceActor.java:run(922)) - Exception in BPOfferService for Block pool 
> BP-1842952724-10.22.68.249-1713771988830 (Datanode Uuid 
> 1659ffaf-1a80-4a8e-a542-643f6bd97ed4) service to localhost/127.0.0.1:64977
> java.lang.NullPointerException
>     at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReceivedAndDeleted(DatanodeProtocolClientSideTranslatorPB.java:246)
>     at 
> org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager.sendIBRs(IncrementalBlockReportManager.java:218)
>     at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:749)
>     at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:920)
>     at java.lang.Thread.run(Thread.java:748) {code}
> The root cause is in BPOfferService#notifyNamenodeBlock, happens when it's 
> called on a block belonging to a volume already removed prior. Because the 
> volume was already removed
>  
> {code:java}
> private void notifyNamenodeBlock(ExtendedBlock block, BlockStatus status,
> String delHint, String storageUuid, boolean isOnTransientStorage) {
>   checkBlock(block);
>   final ReceivedDeletedBlockInfo info = new ReceivedDeletedBlockInfo(
>   block.getLocalBlock(), status, delHint);
>   final DatanodeStorage storage = dn.getFSDataset().getStorage(storageUuid);
>   
>   // storage == null here because it's already removed earlier.
>   for (BPServiceActor actor : bpServices) {
> actor.getIbrManager().notifyNamenodeBlock(info, storage,
> isOnTransientStorage);
>   }
> } {code}
> so IBRs with a null storage are now pending.
> The reason why notifyNamenodeBlock can trigger on such blocks is up in 
> DirectoryScanner#reconcile
> {code:java}
>   public void reconcile() throws IOException {
>     LOG.debug("reconcile start DirectoryScanning");
>     scan();
> // If a volume is removed here after scan() already finished running,
> // diffs is stale and checkAndUpdate will run on a removed volume
>     // HDFS-14476: run checkAndUpdate with batch to avoid holding the lock too
>     // long
>     int loopCount = 0;
>     synchronized (diffs) {
>       for (final Map.Entry entry : diffs.getEntries()) {
>         dataset.checkAndUpdate(entry.getKey(), entry.getValue());        
>     ...
>   } {code}
> Inside checkAndUpdate, memBlockInfo is null because all the block meta in 
> memory is removed during the volume removal, but diskFile still exists. Then 
> DataNode#notifyNamenodeDeletedBlock (and further down the line, 
> notifyNamenodeBlock) is called on this block.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17484) Introduce redundancy.considerLoad.minLoad to avoiding excluding nodes when they are not busy actually

2024-04-22 Thread farmmamba (Jira)
farmmamba created HDFS-17484:


 Summary: Introduce redundancy.considerLoad.minLoad to avoiding 
excluding nodes when they are not busy actually
 Key: HDFS-17484
 URL: https://issues.apache.org/jira/browse/HDFS-17484
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 3.4.0
Reporter: farmmamba
Assignee: farmmamba


Currently, we have `dfs.namenode.redundancy.considerLoad` equals true by 
default, and 

dfs.namenode.redundancy.considerLoad.factor equals 2.0 by default.

Think about below situation. when we are doing stress test, we may deploy hdfs 
client onto the datanode. So, this hdfs client will prefer to write to its 
local datanode and increase this machine's load.  Suppose we have 3 datanodes, 
the load of them are as below:  5.0, 0.2, 0.3.

 

The load equals to 5.0 will be excluded when choose datanodes for a block. But 
actually, it is not slow node when load equals to 5.0 for a machine with 80 cpu 
cores.

 

So, we should better add a new configuration entry :  
`dfs.namenode.redundancy.considerLoad.minLoad` to indicate the mininum factor 
we will make considerLoad take effect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17466) Move FsVolumeList#getVolumes() invocation out of DataSetLock

2024-04-15 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba updated HDFS-17466:
-
Summary: Move FsVolumeList#getVolumes() invocation out of DataSetLock  
(was: Remove FsVolumeList#getVolumes() invocation out of DataSetLock)

> Move FsVolumeList#getVolumes() invocation out of DataSetLock
> 
>
> Key: HDFS-17466
> URL: https://issues.apache.org/jira/browse/HDFS-17466
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode
>Affects Versions: 3.4.0
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17470) FsVolumeList#getNextVolume can be moved out of DataSetLock

2024-04-15 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba updated HDFS-17470:
-
Issue Type: Improvement  (was: Task)

> FsVolumeList#getNextVolume can be moved out of DataSetLock
> --
>
> Key: HDFS-17470
> URL: https://issues.apache.org/jira/browse/HDFS-17470
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>
> FsVolumeList#getNextVolume can be out of BLOCK_POOl read lock.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17470) FsVolumeList#getNextVolume can be moved out of DataSetLock

2024-04-15 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba updated HDFS-17470:
-
Parent: HDFS-15382
Issue Type: Sub-task  (was: Improvement)

> FsVolumeList#getNextVolume can be moved out of DataSetLock
> --
>
> Key: HDFS-17470
> URL: https://issues.apache.org/jira/browse/HDFS-17470
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>
> FsVolumeList#getNextVolume can be out of BLOCK_POOl read lock.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17470) FsVolumeList#getNextVolume can be moved out of DataSetLock

2024-04-15 Thread farmmamba (Jira)
farmmamba created HDFS-17470:


 Summary: FsVolumeList#getNextVolume can be moved out of DataSetLock
 Key: HDFS-17470
 URL: https://issues.apache.org/jira/browse/HDFS-17470
 Project: Hadoop HDFS
  Issue Type: Task
  Components: datanode
Reporter: farmmamba
Assignee: farmmamba


FsVolumeList#getNextVolume can be out of BLOCK_POOl read lock.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17467) IncrementalBlockReportManager#getPerStorageIBR may throw NPE when remove volumes

2024-04-15 Thread farmmamba (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837114#comment-17837114
 ] 

farmmamba commented on HDFS-17467:
--

[~hexiaoqiao] [~zhangshuyan] [~ayushsaxena] [~tomscut]  Sir, could you please 
check this problem when you are free ? Thanks a lot.

> IncrementalBlockReportManager#getPerStorageIBR may throw NPE when remove 
> volumes
> 
>
> Key: HDFS-17467
> URL: https://issues.apache.org/jira/browse/HDFS-17467
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.4.0
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>
> When we remove volumes, it may cause 
> IncrementalBlockReportManager#getPerStorageIBR throws NPE.
> Consider below situation:
> 1、we have down createRbw、finalizeBlock.  But have not done 
> datanode.closeBlock in method `BlockReceiver.PacketResponder#finalizeBlock`.
> 2、we remove volume which replica was written to and it executes such code: 
> `storageMap.remove(storageUuid);`
> 3、 we begin to execute datanode.closeBlock which try to send IBR to NameNode. 
> but when getting DatanodeStorage from storageMap using 
> storageUuid, we will get null because we have remove this storageUuid key 
> from storageMap.
> 4、Throw NPE in getPerStorageIBR method, because ConcurrentHashMap don't allow 
> null key.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17467) IncrementalBlockReportManager#getPerStorageIBR may throw NPE when remove volumes

2024-04-15 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba updated HDFS-17467:
-
Description: 
When we remove volumes, it may cause 
IncrementalBlockReportManager#getPerStorageIBR throws NPE.

Consider below situation:

1、we have down createRbw、finalizeBlock.  But have not done datanode.closeBlock 
in method `BlockReceiver.PacketResponder#finalizeBlock`.

2、we remove volume which replica was written to and it executes such code: 
`storageMap.remove(storageUuid);`

3、 we begin to execute datanode.closeBlock which try to send IBR to NameNode. 
but when getting DatanodeStorage from storageMap using 

storageUuid, we will get null because we have remove this storageUuid key from 
storageMap.

4、Throw NPE in getPerStorageIBR method, because ConcurrentHashMap don't allow 
null key.

 

 

 

  was:
When we remove volumes, it may causeConsider below situation:

 


> IncrementalBlockReportManager#getPerStorageIBR may throw NPE when remove 
> volumes
> 
>
> Key: HDFS-17467
> URL: https://issues.apache.org/jira/browse/HDFS-17467
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.4.0
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>
> When we remove volumes, it may cause 
> IncrementalBlockReportManager#getPerStorageIBR throws NPE.
> Consider below situation:
> 1、we have down createRbw、finalizeBlock.  But have not done 
> datanode.closeBlock in method `BlockReceiver.PacketResponder#finalizeBlock`.
> 2、we remove volume which replica was written to and it executes such code: 
> `storageMap.remove(storageUuid);`
> 3、 we begin to execute datanode.closeBlock which try to send IBR to NameNode. 
> but when getting DatanodeStorage from storageMap using 
> storageUuid, we will get null because we have remove this storageUuid key 
> from storageMap.
> 4、Throw NPE in getPerStorageIBR method, because ConcurrentHashMap don't allow 
> null key.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17467) IncrementalBlockReportManager#getPerStorageIBR may throw NPE when remove volumes

2024-04-15 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba updated HDFS-17467:
-
Description: 
When we remove volumes, it may causeConsider below situation:

 

> IncrementalBlockReportManager#getPerStorageIBR may throw NPE when remove 
> volumes
> 
>
> Key: HDFS-17467
> URL: https://issues.apache.org/jira/browse/HDFS-17467
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.4.0
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>
> When we remove volumes, it may causeConsider below situation:
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17467) IncrementalBlockReportManager#getPerStorageIBR may throw NPE when remove volumes

2024-04-15 Thread farmmamba (Jira)
farmmamba created HDFS-17467:


 Summary: IncrementalBlockReportManager#getPerStorageIBR may throw 
NPE when remove volumes
 Key: HDFS-17467
 URL: https://issues.apache.org/jira/browse/HDFS-17467
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 3.4.0
Reporter: farmmamba
Assignee: farmmamba






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17466) Remove FsVolumeList#getVolumes() invocation out of DataSetLock

2024-04-14 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba updated HDFS-17466:
-
Parent: HDFS-15382
Issue Type: Sub-task  (was: Improvement)

> Remove FsVolumeList#getVolumes() invocation out of DataSetLock
> --
>
> Key: HDFS-17466
> URL: https://issues.apache.org/jira/browse/HDFS-17466
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode
>Affects Versions: 3.4.0
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17466) Remove FsVolumeList#getVolumes() invocation out of DataSetLock

2024-04-14 Thread farmmamba (Jira)
farmmamba created HDFS-17466:


 Summary: Remove FsVolumeList#getVolumes() invocation out of 
DataSetLock
 Key: HDFS-17466
 URL: https://issues.apache.org/jira/browse/HDFS-17466
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 3.4.0
Reporter: farmmamba
Assignee: farmmamba






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17464) Improve some logs output in class FsDatasetImpl

2024-04-12 Thread farmmamba (Jira)
farmmamba created HDFS-17464:


 Summary: Improve some logs output in class FsDatasetImpl
 Key: HDFS-17464
 URL: https://issues.apache.org/jira/browse/HDFS-17464
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 3.4.0
Reporter: farmmamba
Assignee: farmmamba






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17458) Remove unnecessary BP lock in ReplicaMap

2024-04-09 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba updated HDFS-17458:
-
Description: 
In HDFS-16429 we make LightWeightResizableGSet to be thread safe, and in 
HDFS-16511  we change some methods in ReplicaMap to acquire read lock instead 
of acquiring write lock.

This PR try to remove unnecessary Block_Pool read lock further.

Recently, I performed stress tests on datanodes to measure their read/write 
operations/second.

Before we removing some lock,  it can only achieve ~2K write ops. After 
optimizing, it can achieve more than 5K write ops.

> Remove unnecessary BP lock in ReplicaMap
> 
>
> Key: HDFS-17458
> URL: https://issues.apache.org/jira/browse/HDFS-17458
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.4.0
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>
> In HDFS-16429 we make LightWeightResizableGSet to be thread safe, and in 
> HDFS-16511  we change some methods in ReplicaMap to acquire read lock instead 
> of acquiring write lock.
> This PR try to remove unnecessary Block_Pool read lock further.
> Recently, I performed stress tests on datanodes to measure their read/write 
> operations/second.
> Before we removing some lock,  it can only achieve ~2K write ops. After 
> optimizing, it can achieve more than 5K write ops.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17458) Remove unnecessary BP lock in ReplicaMap

2024-04-09 Thread farmmamba (Jira)
farmmamba created HDFS-17458:


 Summary: Remove unnecessary BP lock in ReplicaMap
 Key: HDFS-17458
 URL: https://issues.apache.org/jira/browse/HDFS-17458
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 3.4.0
Reporter: farmmamba
Assignee: farmmamba






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17434) Selector.select in SocketIOWithTimeout.java has significant overhead

2024-03-20 Thread farmmamba (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17829366#comment-17829366
 ] 

farmmamba commented on HDFS-17434:
--

[~qinyuren] Hi, could you please show your createRbw avgTime?

> Selector.select in SocketIOWithTimeout.java has significant overhead
> 
>
> Key: HDFS-17434
> URL: https://issues.apache.org/jira/browse/HDFS-17434
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: qinyuren
>Priority: Major
> Attachments: image-2024-03-20-19-10-13-016.png, 
> image-2024-03-20-19-22-29-829.png, image-2024-03-20-19-24-02-233.png, 
> image-2024-03-20-19-55-18-378.png
>
>
> In our cluster, the SendDataPacketBlockedOnNetworkNanosAvgTime metric ranges 
> from 5ms to 10ms, exceeding the usual disk reading overhead. Our machine 
> network card bandwidth is 2Mb/s.
> !image-2024-03-20-19-10-13-016.png|width=662,height=135!
> !image-2024-03-20-19-55-18-378.png!
> By adding log printing, it turns out that the Selector.select function has 
> significant overhead.
> !image-2024-03-20-19-22-29-829.png|width=474,height=262!
> !image-2024-03-20-19-24-02-233.png|width=445,height=181!
> I would like to know if this falls within the normal range or how we can 
> improve it.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17433) metrics sumOfActorCommandQueueLength should only record valid commands

2024-03-19 Thread farmmamba (Jira)
farmmamba created HDFS-17433:


 Summary: metrics sumOfActorCommandQueueLength should only record 
valid commands
 Key: HDFS-17433
 URL: https://issues.apache.org/jira/browse/HDFS-17433
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 3.4.0
Reporter: farmmamba
Assignee: farmmamba






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17374) EC: StripedBlockReader#newConnectedPeer should set SO_TIMEOUT and SO_KEEPALIVE

2024-02-07 Thread farmmamba (Jira)
farmmamba created HDFS-17374:


 Summary: EC: StripedBlockReader#newConnectedPeer should set 
SO_TIMEOUT and SO_KEEPALIVE
 Key: HDFS-17374
 URL: https://issues.apache.org/jira/browse/HDFS-17374
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ec
Reporter: farmmamba
Assignee: farmmamba






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17372) CommandProcessingThread#queue should use LinkedBlockingDeque to prevent high priority command blocked by low priority command

2024-02-05 Thread farmmamba (Jira)
farmmamba created HDFS-17372:


 Summary: CommandProcessingThread#queue should use 
LinkedBlockingDeque to prevent high priority command blocked by low priority 
command
 Key: HDFS-17372
 URL: https://issues.apache.org/jira/browse/HDFS-17372
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: farmmamba






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17372) CommandProcessingThread#queue should use LinkedBlockingDeque to prevent high priority command blocked by low priority command

2024-02-05 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba reassigned HDFS-17372:


Assignee: farmmamba

> CommandProcessingThread#queue should use LinkedBlockingDeque to prevent high 
> priority command blocked by low priority command
> -
>
> Key: HDFS-17372
> URL: https://issues.apache.org/jira/browse/HDFS-17372
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17365) EC: Add extra redunency configuration in checkStreamerFailures to prevent data loss.

2024-01-31 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba updated HDFS-17365:
-
Summary: EC: Add extra redunency configuration in checkStreamerFailures to 
prevent data loss.  (was: Add extra redunency configuration in 
checkStreamerFailures to prevent data loss.)

> EC: Add extra redunency configuration in checkStreamerFailures to prevent 
> data loss.
> 
>
> Key: HDFS-17365
> URL: https://issues.apache.org/jira/browse/HDFS-17365
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: ec
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17365) Add extra redunency configuration in checkStreamerFailures to prevent data loss.

2024-01-31 Thread farmmamba (Jira)
farmmamba created HDFS-17365:


 Summary: Add extra redunency configuration in 
checkStreamerFailures to prevent data loss.
 Key: HDFS-17365
 URL: https://issues.apache.org/jira/browse/HDFS-17365
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: ec
Reporter: farmmamba
Assignee: farmmamba






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17348) Enhance Log when checkLocations in RecoveryTaskStriped

2024-01-30 Thread farmmamba (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812526#comment-17812526
 ] 

farmmamba commented on HDFS-17348:
--

[~tasanuma] Sir, thanks for what you have done~

> Enhance Log when checkLocations in RecoveryTaskStriped
> --
>
> Key: HDFS-17348
> URL: https://issues.apache.org/jira/browse/HDFS-17348
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.4.0
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Trivial
>  Labels: pull-request-available
>
> Enhance IOE log to better debug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17348) Enhance Log when checkLocations in RecoveryTaskStriped

2024-01-30 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba resolved HDFS-17348.
--
Resolution: Resolved

Move changes to HDFS-17358

> Enhance Log when checkLocations in RecoveryTaskStriped
> --
>
> Key: HDFS-17348
> URL: https://issues.apache.org/jira/browse/HDFS-17348
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.4.0
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Trivial
>  Labels: pull-request-available
>
> Enhance IOE log to better debug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17363) Avoid initializing unnecessary objects in method getBlockRecoveryCommand

2024-01-29 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba resolved HDFS-17363.
--
Resolution: Not A Problem

> Avoid initializing unnecessary objects in method getBlockRecoveryCommand
> 
>
> Key: HDFS-17363
> URL: https://issues.apache.org/jira/browse/HDFS-17363
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>
> In method getBlockRecoveryCommand, we have below codes:
> {code:java}
>else {
>         rBlock = new RecoveringBlock(primaryBlock, recoveryInfos,
>             uc.getBlockRecoveryId());
>         if (b.isStriped()) {
>           rBlock = new RecoveringStripedBlock(rBlock,
>               uc.getBlockIndicesForSpecifiedStorages(storageIdx),
>               ((BlockInfoStriped) b).getErasureCodingPolicy());
>         } {code}
> It seems that we initialize RecoveringBlock object every time even though 
> b.isStriped returns true.
> This is unnecessary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17363) Avoid initializing unnecessary objects in method getBlockRecoveryCommand

2024-01-29 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba updated HDFS-17363:
-
Description: 
In method getBlockRecoveryCommand, we have below codes:
{code:java}
   else {
        rBlock = new RecoveringBlock(primaryBlock, recoveryInfos,
            uc.getBlockRecoveryId());
        if (b.isStriped()) {
          rBlock = new RecoveringStripedBlock(rBlock,
              uc.getBlockIndicesForSpecifiedStorages(storageIdx),
              ((BlockInfoStriped) b).getErasureCodingPolicy());
        } {code}
It seems that we initialize RecoveringBlock object every time even though 
b.isStriped returns true.

This is unnecessary.

  was:
In method getBlockRecoveryCommand, we have below codes:
{code:java}
   else {
        rBlock = new RecoveringBlock(primaryBlock, recoveryInfos,
            uc.getBlockRecoveryId());
        if (b.isStriped()) {
          rBlock = new RecoveringStripedBlock(rBlock,
              uc.getBlockIndicesForSpecifiedStorages(storageIdx),
              ((BlockInfoStriped) b).getErasureCodingPolicy());
        } {code}
It seems that we initialize RecoveringBlock object every time even though 
b.isStriped returns true.

This is 


> Avoid initializing unnecessary objects in method getBlockRecoveryCommand
> 
>
> Key: HDFS-17363
> URL: https://issues.apache.org/jira/browse/HDFS-17363
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>
> In method getBlockRecoveryCommand, we have below codes:
> {code:java}
>else {
>         rBlock = new RecoveringBlock(primaryBlock, recoveryInfos,
>             uc.getBlockRecoveryId());
>         if (b.isStriped()) {
>           rBlock = new RecoveringStripedBlock(rBlock,
>               uc.getBlockIndicesForSpecifiedStorages(storageIdx),
>               ((BlockInfoStriped) b).getErasureCodingPolicy());
>         } {code}
> It seems that we initialize RecoveringBlock object every time even though 
> b.isStriped returns true.
> This is unnecessary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17363) Avoid initializing unnecessary objects in method getBlockRecoveryCommand

2024-01-29 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba updated HDFS-17363:
-
Description: 
In method getBlockRecoveryCommand, we have below codes:
{code:java}
   else {
        rBlock = new RecoveringBlock(primaryBlock, recoveryInfos,
            uc.getBlockRecoveryId());
        if (b.isStriped()) {
          rBlock = new RecoveringStripedBlock(rBlock,
              uc.getBlockIndicesForSpecifiedStorages(storageIdx),
              ((BlockInfoStriped) b).getErasureCodingPolicy());
        } {code}
It seems that we initialize RecoveringBlock object every time even though 
b.isStriped returns true.

This is 

  was:In method 


> Avoid initializing unnecessary objects in method getBlockRecoveryCommand
> 
>
> Key: HDFS-17363
> URL: https://issues.apache.org/jira/browse/HDFS-17363
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>
> In method getBlockRecoveryCommand, we have below codes:
> {code:java}
>else {
>         rBlock = new RecoveringBlock(primaryBlock, recoveryInfos,
>             uc.getBlockRecoveryId());
>         if (b.isStriped()) {
>           rBlock = new RecoveringStripedBlock(rBlock,
>               uc.getBlockIndicesForSpecifiedStorages(storageIdx),
>               ((BlockInfoStriped) b).getErasureCodingPolicy());
>         } {code}
> It seems that we initialize RecoveringBlock object every time even though 
> b.isStriped returns true.
> This is 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17363) Avoid initializing unnecessary objects in method getBlockRecoveryCommand

2024-01-29 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba updated HDFS-17363:
-
Description: In method 

> Avoid initializing unnecessary objects in method getBlockRecoveryCommand
> 
>
> Key: HDFS-17363
> URL: https://issues.apache.org/jira/browse/HDFS-17363
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>
> In method 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17363) Avoid initializing unnecessary objects in method getBlockRecoveryCommand

2024-01-29 Thread farmmamba (Jira)
farmmamba created HDFS-17363:


 Summary: Avoid initializing unnecessary objects in method 
getBlockRecoveryCommand
 Key: HDFS-17363
 URL: https://issues.apache.org/jira/browse/HDFS-17363
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: farmmamba
Assignee: farmmamba






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17358) EC: infinite lease recovery caused by the length of RWR equals to zero.

2024-01-28 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba updated HDFS-17358:
-
Description: 
Recently, there is a strange case happened on our ec production cluster.

The phenomenon is as below described: NameNode does infinite recovery lease of 
some ec files(~80K+) and those files could never be closed.

 

After digging into logs and releated code, we found the root cause is below 
codes in method `BlockRecoveryWorker$RecoveryTaskStriped#recover`:
{code:java}
          // we met info.getNumBytes==0 here! 
  if (info != null &&
              info.getGenerationStamp() >= block.getGenerationStamp() &&
              info.getNumBytes() > 0) {
            final BlockRecord existing = syncBlocks.get(blockId);
            if (existing == null ||
                info.getNumBytes() > existing.rInfo.getNumBytes()) {
              // if we have >1 replicas for the same internal block, we
              // simply choose the one with larger length.
              // TODO: better usage of redundant replicas
              syncBlocks.put(blockId, new BlockRecord(id, proxyDN, info));
            }
          }

  // throw exception here!
          checkLocations(syncBlocks.size());


{code}
The related logs are as below:
{code:java}
java.io.IOException: 
BP-1157541496-10.104.10.198-1702548776421:blk_-9223372036808032688_2938828 has 
no enough internal blocks, unable to start recovery. Locations=[...] {code}
{code:java}
2024-01-23 12:48:16,171 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
initReplicaRecovery: blk_-9223372036808032686_2938828, recoveryId=27615365, 
replica=ReplicaUnderRecovery, blk_-9223372036808032686_2938828, RUR 
getNumBytes() = 0 getBytesOnDisk() = 0 getVisibleLength()= -1 getVolume() = 
/data25/hadoop/hdfs/datanode getBlockURI() = 
file:/data25/hadoop/hdfs/datanode/current/BP-1157541496-x.x.x.x-1702548776421/current/rbw/blk_-9223372036808032686
 recoveryId=27529675 original=ReplicaWaitingToBeRecovered, 
blk_-9223372036808032686_2938828, RWR getNumBytes() = 0 getBytesOnDisk() = 0 
getVisibleLength()= -1 getVolume() = /data25/hadoop/hdfs/datanode getBlockURI() 
= 
file:/data25/hadoop/hdfs/datanode/current/BP-1157541496-10.104.10.198-1702548776421/current/rbw/blk_-9223372036808032686
{code}
because the length of RWR is zero,  the length of the returned object in below 
codes is zero. We can't put it into syncBlocks.

So throw exception in checkLocations method.
{code:java}
          ReplicaRecoveryInfo info = callInitReplicaRecovery(proxyDN,
              new RecoveringBlock(internalBlk, null, recoveryId)); {code}

  was:
Recently, there is a strange case happened on our ec production cluster.

The phenomenon is as below described: NameNode does inrecovery lease of


> EC: infinite lease recovery caused by the length of RWR equals to zero.
> ---
>
> Key: HDFS-17358
> URL: https://issues.apache.org/jira/browse/HDFS-17358
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>
> Recently, there is a strange case happened on our ec production cluster.
> The phenomenon is as below described: NameNode does infinite recovery lease 
> of some ec files(~80K+) and those files could never be closed.
>  
> After digging into logs and releated code, we found the root cause is below 
> codes in method `BlockRecoveryWorker$RecoveryTaskStriped#recover`:
> {code:java}
>           // we met info.getNumBytes==0 here! 
>   if (info != null &&
>               info.getGenerationStamp() >= block.getGenerationStamp() &&
>               info.getNumBytes() > 0) {
>             final BlockRecord existing = syncBlocks.get(blockId);
>             if (existing == null ||
>                 info.getNumBytes() > existing.rInfo.getNumBytes()) {
>               // if we have >1 replicas for the same internal block, we
>               // simply choose the one with larger length.
>               // TODO: better usage of redundant replicas
>               syncBlocks.put(blockId, new BlockRecord(id, proxyDN, info));
>             }
>           }
>   // throw exception here!
>           checkLocations(syncBlocks.size());
> {code}
> The related logs are as below:
> {code:java}
> java.io.IOException: 
> BP-1157541496-10.104.10.198-1702548776421:blk_-9223372036808032688_2938828 
> has no enough internal blocks, unable to start recovery. Locations=[...] 
> {code}
> {code:java}
> 2024-01-23 12:48:16,171 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> initReplicaRecovery: blk_-9223372036808032686_2938828, recoveryId=27615365, 
> replica=ReplicaUnderRecovery, blk_-9223372036808032686_2938828, RUR 
> getNumBytes() = 0 

[jira] [Created] (HDFS-17359) EC: recheck failed streamers should only after flushing all packets.

2024-01-26 Thread farmmamba (Jira)
farmmamba created HDFS-17359:


 Summary: EC: recheck failed streamers should only after flushing 
all packets.
 Key: HDFS-17359
 URL: https://issues.apache.org/jira/browse/HDFS-17359
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: ec
Reporter: farmmamba
Assignee: farmmamba


In method DFSStripedOutputStream#checkStreamerFailures, we have below codes:
{code:java}
    Set newFailed = checkStreamers();
    if (newFailed.size() == 0) {
      return;
    }    if (isNeedFlushAllPackets) {
      // for healthy streamers, wait till all of them have fetched the new block
      // and flushed out all the enqueued packets.
      flushAllInternals();
    }
    // recheck failed streamers again after the flush
    newFailed = checkStreamers(); {code}
We should better move the re-check logic into if condition to reduce useless 
invocation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17358) EC: infinite lease recovery

2024-01-25 Thread farmmamba (Jira)
farmmamba created HDFS-17358:


 Summary: EC: infinite lease recovery
 Key: HDFS-17358
 URL: https://issues.apache.org/jira/browse/HDFS-17358
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ec
Reporter: farmmamba
Assignee: farmmamba






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17348) Enhance Log when checkLocations in RecoveryTaskStriped

2024-01-22 Thread farmmamba (Jira)
farmmamba created HDFS-17348:


 Summary: Enhance Log when checkLocations in RecoveryTaskStriped
 Key: HDFS-17348
 URL: https://issues.apache.org/jira/browse/HDFS-17348
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 3.4.0
Reporter: farmmamba
Assignee: farmmamba


Enhance IOE log to better debug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17347) EC: Limit recovery worker counts to prevent holding two many network connections.

2024-01-22 Thread farmmamba (Jira)
farmmamba created HDFS-17347:


 Summary: EC: Limit recovery worker counts to prevent holding two 
many network connections.
 Key: HDFS-17347
 URL: https://issues.apache.org/jira/browse/HDFS-17347
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: farmmamba
Assignee: farmmamba






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17345) Add a metrics to record block report generating cost time

2024-01-20 Thread farmmamba (Jira)
farmmamba created HDFS-17345:


 Summary: Add a metrics to record block report generating cost time
 Key: HDFS-17345
 URL: https://issues.apache.org/jira/browse/HDFS-17345
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 3.5.0
Reporter: farmmamba
Assignee: farmmamba


Currently, we have block report send time metrics recorded by blockReports.

We should better add another metric to record block report creating cost time:
{code:java}
long brCreateCost = brSendStartTime - brCreateStartTime; {code}
It is useful for us to measure the perfomance of creating block reports.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10224) Implement asynchronous rename for DistributedFileSystem

2024-01-19 Thread farmmamba (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-10224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808845#comment-17808845
 ] 

farmmamba commented on HDFS-10224:
--

[~szetszwo] Sir, Thanks a lot ~

> Implement asynchronous rename for DistributedFileSystem
> ---
>
> Key: HDFS-10224
> URL: https://issues.apache.org/jira/browse/HDFS-10224
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: fs, hdfs-client
>Reporter: Xiaobing Zhou
>Assignee: Xiaobing Zhou
>Priority: Major
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: HDFS-10224-HDFS-9924.000.patch, 
> HDFS-10224-HDFS-9924.001.patch, HDFS-10224-HDFS-9924.002.patch, 
> HDFS-10224-HDFS-9924.003.patch, HDFS-10224-HDFS-9924.004.patch, 
> HDFS-10224-HDFS-9924.005.patch, HDFS-10224-HDFS-9924.006.patch, 
> HDFS-10224-HDFS-9924.007.patch, HDFS-10224-HDFS-9924.008.patch, 
> HDFS-10224-HDFS-9924.009.patch, HDFS-10224-and-HADOOP-12909.000.patch, 
> image-2024-01-19-23-06-32-901.png
>
>
> This is proposed to implement an asynchronous rename.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10224) Implement asynchronous rename for DistributedFileSystem

2024-01-19 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-10224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba updated HDFS-10224:
-
Attachment: image-2024-01-19-23-06-32-901.png

> Implement asynchronous rename for DistributedFileSystem
> ---
>
> Key: HDFS-10224
> URL: https://issues.apache.org/jira/browse/HDFS-10224
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: fs, hdfs-client
>Reporter: Xiaobing Zhou
>Assignee: Xiaobing Zhou
>Priority: Major
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: HDFS-10224-HDFS-9924.000.patch, 
> HDFS-10224-HDFS-9924.001.patch, HDFS-10224-HDFS-9924.002.patch, 
> HDFS-10224-HDFS-9924.003.patch, HDFS-10224-HDFS-9924.004.patch, 
> HDFS-10224-HDFS-9924.005.patch, HDFS-10224-HDFS-9924.006.patch, 
> HDFS-10224-HDFS-9924.007.patch, HDFS-10224-HDFS-9924.008.patch, 
> HDFS-10224-HDFS-9924.009.patch, HDFS-10224-and-HADOOP-12909.000.patch, 
> image-2024-01-19-23-06-32-901.png
>
>
> This is proposed to implement an asynchronous rename.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10224) Implement asynchronous rename for DistributedFileSystem

2024-01-19 Thread farmmamba (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-10224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808697#comment-17808697
 ] 

farmmamba commented on HDFS-10224:
--

[~szetszwo] [~xiaobingo] Hi, sir. Could you please tell me why i can not find 
the class AsyncDistributedFileSystem in current trunk branch? Thanks a lot.

 

!image-2024-01-19-23-06-32-901.png!

> Implement asynchronous rename for DistributedFileSystem
> ---
>
> Key: HDFS-10224
> URL: https://issues.apache.org/jira/browse/HDFS-10224
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: fs, hdfs-client
>Reporter: Xiaobing Zhou
>Assignee: Xiaobing Zhou
>Priority: Major
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: HDFS-10224-HDFS-9924.000.patch, 
> HDFS-10224-HDFS-9924.001.patch, HDFS-10224-HDFS-9924.002.patch, 
> HDFS-10224-HDFS-9924.003.patch, HDFS-10224-HDFS-9924.004.patch, 
> HDFS-10224-HDFS-9924.005.patch, HDFS-10224-HDFS-9924.006.patch, 
> HDFS-10224-HDFS-9924.007.patch, HDFS-10224-HDFS-9924.008.patch, 
> HDFS-10224-HDFS-9924.009.patch, HDFS-10224-and-HADOOP-12909.000.patch
>
>
> This is proposed to implement an asynchronous rename.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17344) Last packet will be splited into two parts when write block

2024-01-19 Thread farmmamba (Jira)
farmmamba created HDFS-17344:


 Summary: Last packet will be splited into two parts when write 
block
 Key: HDFS-17344
 URL: https://issues.apache.org/jira/browse/HDFS-17344
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client
Affects Versions: 3.5.0
Reporter: farmmamba
Assignee: farmmamba


As mentioned in 
[https://github.com/apache/hadoop/pull/6368#issuecomment-1899635293]

This Jira  try to solve that problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17311) RBF: ConnectionManager creatorQueue should offer a pool that is not already in creatorQueue.

2024-01-18 Thread farmmamba (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808478#comment-17808478
 ] 

farmmamba commented on HDFS-17311:
--

Can use “git commit —allow-empty”


张浩博
hfutzhan...@163.com


 Replied Message 

[ 
https://issues.apache.org/jira/browse/HDFS-17311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808474#comment-17808474
 ]

ASF GitHub Bot commented on HDFS-17311:
---

LiuGuH commented on PR #6392:
URL: https://github.com/apache/hadoop/pull/6392#issuecomment-1899821409

@LiuGuH Thanks for the contribution! Can we trigger compilation again?

Thanks for review. Now  triggered compilation.
And I triggerd compilation with command "git commit --amend && git push -f ".   
 Is there any other way to trigger  compilation? Thanks




RBF: ConnectionManager creatorQueue should offer a pool that is not already in 
creatorQueue.


Key: HDFS-17311
URL: https://issues.apache.org/jira/browse/HDFS-17311
Project: Hadoop HDFS
Issue Type: Improvement
Reporter: liuguanghua
Assignee: liuguanghua
Priority: Major
Labels: pull-request-available

In the Router, find blow log
 
2023-12-29 15:18:54,799 ERROR 
org.apache.hadoop.hdfs.server.federation.router.ConnectionManager: Cannot add 
more than 2048 connections at the same time
 
The log indicates that ConnectionManager.creatorQueue is full at a certain 
point. But my cluster does not have so many users cloud reach up 2048 pair of 
.
This may be due to the following reasons:
# ConnectionManager.creatorQueue is a queue that will be offered ConnectionPool 
if ConnectionContext is not enough.
# ConnectionCreator thread will consume from creatorQueue and make more 
ConnectionContexts for a ConnectionPool.
# Client will concurrent invoke for ConnectionManager.getConnection() for a 
same user. And this maybe lead to add many same ConnectionPool into 
ConnectionManager.creatorQueue.
# When creatorQueue is full, a new ConnectionPool will not be added in 
successfully and log this error. This maybe lead to a really new ConnectionPool 
clould not produce more ConnectionContexts for new user.
So this pr try to make creatorQueue will not add same ConnectionPool at once.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org


> RBF: ConnectionManager creatorQueue should offer a pool that is not already 
> in creatorQueue.
> 
>
> Key: HDFS-17311
> URL: https://issues.apache.org/jira/browse/HDFS-17311
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
> In the Router, find blow log
>  
> 2023-12-29 15:18:54,799 ERROR 
> org.apache.hadoop.hdfs.server.federation.router.ConnectionManager: Cannot add 
> more than 2048 connections at the same time
>  
> The log indicates that ConnectionManager.creatorQueue is full at a certain 
> point. But my cluster does not have so many users cloud reach up 2048 pair of 
> .
> This may be due to the following reasons:
>  # ConnectionManager.creatorQueue is a queue that will be offered 
> ConnectionPool if ConnectionContext is not enough.
>  # ConnectionCreator thread will consume from creatorQueue and make more 
> ConnectionContexts for a ConnectionPool.
>  # Client will concurrent invoke for ConnectionManager.getConnection() for a 
> same user. And this maybe lead to add many same ConnectionPool into 
> ConnectionManager.creatorQueue.
>  # When creatorQueue is full, a new ConnectionPool will not be added in 
> successfully and log this error. This maybe lead to a really new 
> ConnectionPool clould not produce more ConnectionContexts for new user.
> So this pr try to make creatorQueue will not add same ConnectionPool at once.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17334) FSEditLogAsync#enqueueEdit does not synchronized this before invoke wait method

2024-01-18 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba resolved HDFS-17334.
--
Resolution: Not A Problem

> FSEditLogAsync#enqueueEdit does not synchronized this before invoke wait 
> method
> ---
>
> Key: HDFS-17334
> URL: https://issues.apache.org/jira/browse/HDFS-17334
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.6
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> In method FSEditLogAsync#enqueueEdit , there exist the below codes:
> {code:java}
> if (Thread.holdsLock(this)) {
>           // if queue is full, synchronized caller must immediately relinquish
>           // the monitor before re-offering to avoid deadlock with sync thread
>           // which needs the monitor to write transactions.
>           int permits = overflowMutex.drainPermits();
>           try {
>             do {
>               this.wait(1000); // will be notified by next logSync.
>             } while (!editPendingQ.offer(edit));
>           } finally {
>             overflowMutex.release(permits);
>           }
>         }  {code}
> It maybe invoke this.wait(1000) without having object this's monitor.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17337) RPC RESPONSE time seems not exactly accurate when using FSEditLogAsync.

2024-01-11 Thread farmmamba (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805470#comment-17805470
 ] 

farmmamba commented on HDFS-17337:
--

[~hexiaoqiao] [~zhangshuyan] [~tomscut] [~ayushsaxena] Sir, could you please 
take a look at this case when you are free?  Thanks a lot. 

> RPC RESPONSE time seems not exactly accurate when using FSEditLogAsync.
> ---
>
> Key: HDFS-17337
> URL: https://issues.apache.org/jira/browse/HDFS-17337
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.6
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>
> Currently, FSEditLogAsync is enabled by default. 
> We have below codes in method Server$RpcCall#run:
>  
> {code:java}
>       if (!isResponseDeferred()) {
>         long deltaNanos = Time.monotonicNowNanos() - startNanos;
>         ProcessingDetails details = getProcessingDetails();        
> details.set(Timing.PROCESSING, deltaNanos, TimeUnit.NANOSECONDS);
>         deltaNanos -= details.get(Timing.LOCKWAIT, TimeUnit.NANOSECONDS);
>         deltaNanos -= details.get(Timing.LOCKSHARED, TimeUnit.NANOSECONDS);
>         deltaNanos -= details.get(Timing.LOCKEXCLUSIVE, TimeUnit.NANOSECONDS);
>         details.set(Timing.LOCKFREE, deltaNanos, TimeUnit.NANOSECONDS);
>         startNanos = Time.monotonicNowNanos();
> setResponseFields(value, responseParams);
>         sendResponse();        
> deltaNanos = Time.monotonicNowNanos() - startNanos;
>         details.set(Timing.RESPONSE, deltaNanos, TimeUnit.NANOSECONDS);
>       } else {
>         if (LOG.isDebugEnabled()) {
>           LOG.debug("Deferring response for callId: " + this.callId);
>         }
>       }{code}
> It computes Timing.RESPONSE of a RpcCall using *Time.monotonicNowNanos() - 
> startNanos;*
> However, if we use async editlogging,  we will not send response here but in 
> FSEditLogAsync.RpcEdit#logSyncNotify.
> This causes the Timing.RESPONSE of a RpcCall not be exactly accurate.
> {code:java}
>     @Override
>     public void logSyncNotify(RuntimeException syncEx) {
>       try {
>         if (syncEx == null) {
>           call.sendResponse();
>         } else {
>           call.abortResponse(syncEx);
>         }
>       } catch (Exception e) {} // don't care if not sent.
>     } {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17337) RPC RESPONSE time seems not exactly accurate when using FSEditLogAsync.

2024-01-11 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba updated HDFS-17337:
-
Description: 
Currently, FSEditLogAsync is enabled by default. 

We have below codes in method Server$RpcCall#run:

 
{code:java}
      if (!isResponseDeferred()) {
        long deltaNanos = Time.monotonicNowNanos() - startNanos;
        ProcessingDetails details = getProcessingDetails();        
details.set(Timing.PROCESSING, deltaNanos, TimeUnit.NANOSECONDS);
        deltaNanos -= details.get(Timing.LOCKWAIT, TimeUnit.NANOSECONDS);
        deltaNanos -= details.get(Timing.LOCKSHARED, TimeUnit.NANOSECONDS);
        deltaNanos -= details.get(Timing.LOCKEXCLUSIVE, TimeUnit.NANOSECONDS);
        details.set(Timing.LOCKFREE, deltaNanos, TimeUnit.NANOSECONDS);
        startNanos = Time.monotonicNowNanos();
setResponseFields(value, responseParams);
        sendResponse();        
deltaNanos = Time.monotonicNowNanos() - startNanos;
        details.set(Timing.RESPONSE, deltaNanos, TimeUnit.NANOSECONDS);
      } else {
        if (LOG.isDebugEnabled()) {
          LOG.debug("Deferring response for callId: " + this.callId);
        }
      }{code}
It computes Timing.RESPONSE of a RpcCall using *Time.monotonicNowNanos() - 
startNanos;*

However, if we use async editlogging,  we will not send response here but in 
FSEditLogAsync.RpcEdit#logSyncNotify.

This causes the Timing.RESPONSE of a RpcCall not be exactly accurate.
{code:java}
    @Override
    public void logSyncNotify(RuntimeException syncEx) {
      try {
        if (syncEx == null) {
          call.sendResponse();
        } else {
          call.abortResponse(syncEx);
        }
      } catch (Exception e) {} // don't care if not sent.
    } {code}
 

> RPC RESPONSE time seems not exactly accurate when using FSEditLogAsync.
> ---
>
> Key: HDFS-17337
> URL: https://issues.apache.org/jira/browse/HDFS-17337
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.6
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>
> Currently, FSEditLogAsync is enabled by default. 
> We have below codes in method Server$RpcCall#run:
>  
> {code:java}
>       if (!isResponseDeferred()) {
>         long deltaNanos = Time.monotonicNowNanos() - startNanos;
>         ProcessingDetails details = getProcessingDetails();        
> details.set(Timing.PROCESSING, deltaNanos, TimeUnit.NANOSECONDS);
>         deltaNanos -= details.get(Timing.LOCKWAIT, TimeUnit.NANOSECONDS);
>         deltaNanos -= details.get(Timing.LOCKSHARED, TimeUnit.NANOSECONDS);
>         deltaNanos -= details.get(Timing.LOCKEXCLUSIVE, TimeUnit.NANOSECONDS);
>         details.set(Timing.LOCKFREE, deltaNanos, TimeUnit.NANOSECONDS);
>         startNanos = Time.monotonicNowNanos();
> setResponseFields(value, responseParams);
>         sendResponse();        
> deltaNanos = Time.monotonicNowNanos() - startNanos;
>         details.set(Timing.RESPONSE, deltaNanos, TimeUnit.NANOSECONDS);
>       } else {
>         if (LOG.isDebugEnabled()) {
>           LOG.debug("Deferring response for callId: " + this.callId);
>         }
>       }{code}
> It computes Timing.RESPONSE of a RpcCall using *Time.monotonicNowNanos() - 
> startNanos;*
> However, if we use async editlogging,  we will not send response here but in 
> FSEditLogAsync.RpcEdit#logSyncNotify.
> This causes the Timing.RESPONSE of a RpcCall not be exactly accurate.
> {code:java}
>     @Override
>     public void logSyncNotify(RuntimeException syncEx) {
>       try {
>         if (syncEx == null) {
>           call.sendResponse();
>         } else {
>           call.abortResponse(syncEx);
>         }
>       } catch (Exception e) {} // don't care if not sent.
>     } {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17337) RPC RESPONSE time seems not exactly accurate when using FSEditLogAsync.

2024-01-11 Thread farmmamba (Jira)
farmmamba created HDFS-17337:


 Summary: RPC RESPONSE time seems not exactly accurate when using 
FSEditLogAsync.
 Key: HDFS-17337
 URL: https://issues.apache.org/jira/browse/HDFS-17337
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 3.3.6
Reporter: farmmamba
Assignee: farmmamba






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17335) Add metrics for syncWaitQ in FSEditLogAsync

2024-01-10 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba reassigned HDFS-17335:


Assignee: farmmamba

> Add metrics for syncWaitQ in FSEditLogAsync
> ---
>
> Key: HDFS-17335
> URL: https://issues.apache.org/jira/browse/HDFS-17335
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.3.6
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Minor
>
> To monitor syncWaitQ in FSEditLogAsync, we add a metric syncPendingCount.
> The reason we add this metrics is that when dequeueEdit() return null,  the 
> boolean variable doSync is set to {color:#0747a6}+!syncWaitQ.isEmpty()+    
> {color:#172b4d}After adding this metrics we can better monitor sync 
> performance and codes.{color}{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17335) Add metrics for syncWaitQ in FSEditLogAsync

2024-01-10 Thread farmmamba (Jira)
farmmamba created HDFS-17335:


 Summary: Add metrics for syncWaitQ in FSEditLogAsync
 Key: HDFS-17335
 URL: https://issues.apache.org/jira/browse/HDFS-17335
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 3.3.6
Reporter: farmmamba


To monitor syncWaitQ in FSEditLogAsync, we add a metric syncPendingCount.

The reason we add this metrics is that when dequeueEdit() return null,  the 
boolean variable doSync is set to {color:#0747a6}+!syncWaitQ.isEmpty()+    
{color:#172b4d}After adding this metrics we can better monitor sync performance 
and codes.{color}{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17334) FSEditLogAsync#enqueueEdit does not synchronized this before invoke wait method

2024-01-10 Thread farmmamba (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805007#comment-17805007
 ] 

farmmamba commented on HDFS-17334:
--

[~hexiaoqiao] [~tomscut] [~zhangshuyan] [~haiyang Hu] Sir, could you please 
help me check this potential problem when you have free time? Thanks ahead.

> FSEditLogAsync#enqueueEdit does not synchronized this before invoke wait 
> method
> ---
>
> Key: HDFS-17334
> URL: https://issues.apache.org/jira/browse/HDFS-17334
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.6
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
> Fix For: 3.5.0
>
>
> In method FSEditLogAsync#enqueueEdit , there exist the below codes:
> {code:java}
> if (Thread.holdsLock(this)) {
>           // if queue is full, synchronized caller must immediately relinquish
>           // the monitor before re-offering to avoid deadlock with sync thread
>           // which needs the monitor to write transactions.
>           int permits = overflowMutex.drainPermits();
>           try {
>             do {
>               this.wait(1000); // will be notified by next logSync.
>             } while (!editPendingQ.offer(edit));
>           } finally {
>             overflowMutex.release(permits);
>           }
>         }  {code}
> It maybe invoke this.wait(1000) without having object this's monitor.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17334) FSEditLogAsync#enqueueEdit does not synchronized this before invoke wait method

2024-01-09 Thread farmmamba (Jira)
farmmamba created HDFS-17334:


 Summary: FSEditLogAsync#enqueueEdit does not synchronized this 
before invoke wait method
 Key: HDFS-17334
 URL: https://issues.apache.org/jira/browse/HDFS-17334
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 3.3.6
Reporter: farmmamba
Assignee: farmmamba
 Fix For: 3.5.0


In method FSEditLogAsync#enqueueEdit , there exist the below codes:
{code:java}
if (Thread.holdsLock(this)) {
          // if queue is full, synchronized caller must immediately relinquish
          // the monitor before re-offering to avoid deadlock with sync thread
          // which needs the monitor to write transactions.
          int permits = overflowMutex.drainPermits();
          try {
            do {
              this.wait(1000); // will be notified by next logSync.
            } while (!editPendingQ.offer(edit));
          } finally {
            overflowMutex.release(permits);
          }
        }  {code}
It maybe invoke this.wait(1000) without having object this's monitor.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17330) CachingGetSpaceUsed#used should use LongAdder after DN splits lock

2024-01-07 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba resolved HDFS-17330.
--
Resolution: Not A Problem

> CachingGetSpaceUsed#used should use LongAdder after DN splits lock
> --
>
> Key: HDFS-17330
> URL: https://issues.apache.org/jira/browse/HDFS-17330
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.6
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>
> After HDFS-15382, dn uses fine-grained lock.
> So we should better use LongAdder instead of AtomicLong in class 
> CachingGetSpaceUsed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17330) CachingGetSpaceUsed#used should use LongAdder after DN splits lock

2024-01-07 Thread farmmamba (Jira)
farmmamba created HDFS-17330:


 Summary: CachingGetSpaceUsed#used should use LongAdder after DN 
splits lock
 Key: HDFS-17330
 URL: https://issues.apache.org/jira/browse/HDFS-17330
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 3.3.6
Reporter: farmmamba
Assignee: farmmamba


After HDFS-15382, dn uses fine-grained lock.

So we should better use LongAdder instead of AtomicLong in class 
CachingGetSpaceUsed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17329) DiskBalancerCluster#nodesToProcess should better use ArrayList

2024-01-06 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba updated HDFS-17329:
-
Description: 
Currently,  nodesToProcess uses LinkedList to store elements.  But in 
computePlan method, we use code like below:
{code:java}
nodesToProcess.get(x); {code}
So, we should better change LinkedList  to ArrayList here.

> DiskBalancerCluster#nodesToProcess should better use ArrayList
> --
>
> Key: HDFS-17329
> URL: https://issues.apache.org/jira/browse/HDFS-17329
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: diskbalancer
>Affects Versions: 3.3.6
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Trivial
>
> Currently,  nodesToProcess uses LinkedList to store elements.  But in 
> computePlan method, we use code like below:
> {code:java}
> nodesToProcess.get(x); {code}
> So, we should better change LinkedList  to ArrayList here.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17329) DiskBalancerCluster#nodesToProcess should better use ArrayList

2024-01-06 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba updated HDFS-17329:
-
  Component/s: diskbalancer
 Target Version/s: 3.5.0
Affects Version/s: 3.3.6
  Summary: DiskBalancerCluster#nodesToProcess should better use 
ArrayList  (was: 
org.apache.hadoop.hdfs.server.diskbalancer.datamodel.DiskBalancerCluster#nodesToProcess)

> DiskBalancerCluster#nodesToProcess should better use ArrayList
> --
>
> Key: HDFS-17329
> URL: https://issues.apache.org/jira/browse/HDFS-17329
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: diskbalancer
>Affects Versions: 3.3.6
>Reporter: farmmamba
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17329) org.apache.hadoop.hdfs.server.diskbalancer.datamodel.DiskBalancerCluster#nodesToProcess

2024-01-06 Thread farmmamba (Jira)
farmmamba created HDFS-17329:


 Summary: 
org.apache.hadoop.hdfs.server.diskbalancer.datamodel.DiskBalancerCluster#nodesToProcess
 Key: HDFS-17329
 URL: https://issues.apache.org/jira/browse/HDFS-17329
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: farmmamba






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17322) RetryCache#MAX_CAPACITY seems to be MIN_CAPACITY

2024-01-03 Thread farmmamba (Jira)
farmmamba created HDFS-17322:


 Summary: RetryCache#MAX_CAPACITY seems to be MIN_CAPACITY
 Key: HDFS-17322
 URL: https://issues.apache.org/jira/browse/HDFS-17322
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: ipc
Affects Versions: 3.3.6
Reporter: farmmamba
Assignee: farmmamba


>From the code logic, we can infer that RetryCache#MAX_CAPACITY should  better 
>be  MIN_CAPACITY.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17318) RBF: MountTableResolver#locationCache supports multi policies

2024-01-02 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba reassigned HDFS-17318:


Assignee: farmmamba

> RBF: MountTableResolver#locationCache supports multi policies
> -
>
> Key: HDFS-17318
> URL: https://issues.apache.org/jira/browse/HDFS-17318
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17318) MountTableResolver#locationCache supports multi policies

2024-01-02 Thread farmmamba (Jira)
farmmamba created HDFS-17318:


 Summary: MountTableResolver#locationCache supports multi policies
 Key: HDFS-17318
 URL: https://issues.apache.org/jira/browse/HDFS-17318
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: rbf
Reporter: farmmamba






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17318) RBF: MountTableResolver#locationCache supports multi policies

2024-01-02 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba updated HDFS-17318:
-
Summary: RBF: MountTableResolver#locationCache supports multi policies  
(was: MountTableResolver#locationCache supports multi policies)

> RBF: MountTableResolver#locationCache supports multi policies
> -
>
> Key: HDFS-17318
> URL: https://issues.apache.org/jira/browse/HDFS-17318
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: farmmamba
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17314) Add a metrics to record congestion backoff counts.

2024-01-01 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba reassigned HDFS-17314:


Assignee: farmmamba

> Add a metrics to record congestion backoff counts.
> --
>
> Key: HDFS-17314
> URL: https://issues.apache.org/jira/browse/HDFS-17314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.6
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>
> When we enable congestion backoff, we should better know how many times 
> datanodes have told client backoff.  This metrics can help us know better 
> about congestion function.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17314) Add a metrics to record congestion backoff counts.

2024-01-01 Thread farmmamba (Jira)
farmmamba created HDFS-17314:


 Summary: Add a metrics to record congestion backoff counts.
 Key: HDFS-17314
 URL: https://issues.apache.org/jira/browse/HDFS-17314
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 3.3.6
Reporter: farmmamba


When we enable congestion backoff, we should better know how many times 
datanodes have told client backoff.  This metrics can help us know better about 
congestion function.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17312) packetsReceived metric should ignore heartbeat packet

2023-12-30 Thread farmmamba (Jira)
farmmamba created HDFS-17312:


 Summary: packetsReceived metric should ignore heartbeat packet
 Key: HDFS-17312
 URL: https://issues.apache.org/jira/browse/HDFS-17312
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 3.3.6
Reporter: farmmamba


Metric packetsReceived should ignore heartbeat packet and only used to count 
data packets and last packet in block.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17312) packetsReceived metric should ignore heartbeat packet

2023-12-30 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba reassigned HDFS-17312:


Assignee: farmmamba

> packetsReceived metric should ignore heartbeat packet
> -
>
> Key: HDFS-17312
> URL: https://issues.apache.org/jira/browse/HDFS-17312
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.6
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>
> Metric packetsReceived should ignore heartbeat packet and only used to count 
> data packets and last packet in block.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17289) Considering the size of non-lastBlocks equals to complete block size can cause append failure.

2023-12-17 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba updated HDFS-17289:
-
Summary: Considering the size of non-lastBlocks equals to complete block 
size can cause append failure.  (was: consider the size of non-lastBlocks 
equals to complete block size can cause append failure.)

> Considering the size of non-lastBlocks equals to complete block size can 
> cause append failure.
> --
>
> Key: HDFS-17289
> URL: https://issues.apache.org/jira/browse/HDFS-17289
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.6
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17289) Should not consider the size of non-lastBlocks equals to complete block size.

2023-12-17 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba updated HDFS-17289:
-
Issue Type: Bug  (was: Improvement)

> Should not consider the size of non-lastBlocks equals to complete block size.
> -
>
> Key: HDFS-17289
> URL: https://issues.apache.org/jira/browse/HDFS-17289
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.6
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17289) consider the size of non-lastBlocks equals to complete block size can cause append failure.

2023-12-17 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba updated HDFS-17289:
-
Summary: consider the size of non-lastBlocks equals to complete block size 
can cause append failure.  (was: Should not consider the size of non-lastBlocks 
equals to complete block size.)

> consider the size of non-lastBlocks equals to complete block size can cause 
> append failure.
> ---
>
> Key: HDFS-17289
> URL: https://issues.apache.org/jira/browse/HDFS-17289
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.6
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17293) First packet data + checksum size will be set to 516 bytes when writing to a new block.

2023-12-16 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba updated HDFS-17293:
-
Summary: First packet data + checksum size will be set to 516 bytes when 
writing to a new block.  (was: First packet size will be set to 516 bytes when 
writing to a new block.)

> First packet data + checksum size will be set to 516 bytes when writing to a 
> new block.
> ---
>
> Key: HDFS-17293
> URL: https://issues.apache.org/jira/browse/HDFS-17293
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.3.6
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>
> First packet size will be set to 516 bytes when writing to a new block.
> In  method computePacketChunkSize, the parameters psize and csize would be 
> (0, 512)
> when writting to a new block. It should better use writePacketSize.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17293) First packet size will be set to 516 bytes when writing to a new block.

2023-12-16 Thread farmmamba (Jira)
farmmamba created HDFS-17293:


 Summary: First packet size will be set to 516 bytes when writing 
to a new block.
 Key: HDFS-17293
 URL: https://issues.apache.org/jira/browse/HDFS-17293
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.3.6
Reporter: farmmamba
Assignee: farmmamba


First packet size will be set to 516 bytes when writing to a new block.

In  method computePacketChunkSize, the parameters psize and csize would be (0, 
512)

when writting to a new block. It should better use writePacketSize.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17291) DataNode metric bytesWritten is not totally accurate in some situations.

2023-12-15 Thread farmmamba (Jira)
farmmamba created HDFS-17291:


 Summary: DataNode metric bytesWritten is not totally accurate in 
some situations.
 Key: HDFS-17291
 URL: https://issues.apache.org/jira/browse/HDFS-17291
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 3.3.6
Reporter: farmmamba


As the title described, dataNode metric bytesWritten is not totally accurate in 
some situations, such as failure recovery, re-send data. We should fix it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17289) Should not consider the size of non-lastBlocks equals to complete block size.

2023-12-13 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba reassigned HDFS-17289:


Assignee: farmmamba

> Should not consider the size of non-lastBlocks equals to complete block size.
> -
>
> Key: HDFS-17289
> URL: https://issues.apache.org/jira/browse/HDFS-17289
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.3.6
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17289) Should not consider the size of non-lastBlocks equals to complete block size.

2023-12-13 Thread farmmamba (Jira)
farmmamba created HDFS-17289:


 Summary: Should not consider the size of non-lastBlocks equals to 
complete block size.
 Key: HDFS-17289
 URL: https://issues.apache.org/jira/browse/HDFS-17289
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.3.6
Reporter: farmmamba






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17288) Add a metric to record the number of non-lastBlocks which have non-complete blocksize

2023-12-13 Thread farmmamba (Jira)
farmmamba created HDFS-17288:


 Summary: Add a metric to record the number of non-lastBlocks which 
have non-complete blocksize
 Key: HDFS-17288
 URL: https://issues.apache.org/jira/browse/HDFS-17288
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: farmmamba


Add a metric to record the number of non-lastBlocks which have non-complete 
blocksize.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17283) Change the name of variable SECOND in HdfsClientConfigKeys

2023-12-09 Thread farmmamba (Jira)
farmmamba created HDFS-17283:


 Summary: Change the name of variable SECOND in HdfsClientConfigKeys
 Key: HDFS-17283
 URL: https://issues.apache.org/jira/browse/HDFS-17283
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client
Affects Versions: 3.3.6
Reporter: farmmamba






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17283) Change the name of variable SECOND in HdfsClientConfigKeys

2023-12-09 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba reassigned HDFS-17283:


Assignee: farmmamba

> Change the name of variable SECOND in HdfsClientConfigKeys
> --
>
> Key: HDFS-17283
> URL: https://issues.apache.org/jira/browse/HDFS-17283
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Affects Versions: 3.3.6
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17280) Pipeline recovery should better end block in advance when bytes acked greater than half of blocksize.

2023-12-08 Thread farmmamba (Jira)
farmmamba created HDFS-17280:


 Summary: Pipeline recovery should better end block in advance when 
bytes acked greater than half of blocksize.
 Key: HDFS-17280
 URL: https://issues.apache.org/jira/browse/HDFS-17280
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client
Reporter: farmmamba






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17273) Change the way of computing some local variables duration for better debugging

2023-12-04 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba updated HDFS-17273:
-
Priority: Minor  (was: Major)

> Change the way of computing some local variables duration for better debugging
> --
>
> Key: HDFS-17273
> URL: https://issues.apache.org/jira/browse/HDFS-17273
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17273) Change the way of computing some local variables duration for better debugging

2023-12-04 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba reassigned HDFS-17273:


Assignee: farmmamba

> Change the way of computing some local variables duration for better debugging
> --
>
> Key: HDFS-17273
> URL: https://issues.apache.org/jira/browse/HDFS-17273
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17273) Change the way of computing some local variables duration for better debugging

2023-12-04 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba updated HDFS-17273:
-
Summary: Change the way of computing some local variables duration for 
better debugging  (was: Change some duration local filed in DataStream)

> Change the way of computing some local variables duration for better debugging
> --
>
> Key: HDFS-17273
> URL: https://issues.apache.org/jira/browse/HDFS-17273
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: farmmamba
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17273) Change some duration local filed in DataStream

2023-12-04 Thread farmmamba (Jira)
farmmamba created HDFS-17273:


 Summary: Change some duration local filed in DataStream
 Key: HDFS-17273
 URL: https://issues.apache.org/jira/browse/HDFS-17273
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: farmmamba






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-17267) Client send the smae packet multiple times when method markSlowNode throws IOException.

2023-11-30 Thread farmmamba (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791659#comment-17791659
 ] 

farmmamba edited comment on HDFS-17267 at 11/30/23 2:43 PM:


We can debug the unit test method testPipelineRecoveryWithSlowNode to verify 
this PR.

Set breakpoint in DataStreamer#run : `LOG.debug("{} sending {}", this, one);`.

We can see the DFSPacket with seq=3 sends twice.

 

BTW, on datanode side. It will not write packet data twice, because if will 
compare the onDiskLen and offsetInBlock in method receivePacket(). if onDiskLen 
>= offsetInBlock, there will not happen writing data behavior.


was (Author: zhanghaobo):
We can debug the unit test method testPipelineRecoveryWithSlowNode to verify 
this PR.

Set breakpoint in DataStreamer#run : `LOG.debug("{} sending {}", this, one);`.

We can see the DFSPacket with seq=3 sends twice.

> Client send the smae packet multiple times when method markSlowNode throws 
> IOException.
> ---
>
> Key: HDFS-17267
> URL: https://issues.apache.org/jira/browse/HDFS-17267
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.3.6
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>
> Since we have HDFS-16348, we can kick out SLOW node in pipeline when writing 
> data to pipeline. 
> And I think it introduced a problem, that is the same packet will be sent 
> twice or more times when we kick out SLOW node.
>  
> The flow are as below:
> 1、 DFSPacket p1 is pushed into dataQueue.
> 2、DataStreamer takes DFSPacket p1 from dataQueue.
> 3、Remove p1 from dataQueue and   push p1 into ackQueue.
> 4、sendPacket(p1).
> 5、In ResponseProcessor#run,  read pipelineAck for p1.
> 6、We meet SlOW node,  so method markSlowNode throw IOException and does not 
> execute `ackQueue.removeFirst();`.
> 7、In next loop of DataStreamer#run, we come into method 
> processDatanodeOrExternalError and execute `dataQueue.addAll(0, ackQueue);`.
> 8、the p1 will be sent repeatedly.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17267) Client send the smae packet multiple times when method markSlowNode throws IOException.

2023-11-30 Thread farmmamba (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791659#comment-17791659
 ] 

farmmamba commented on HDFS-17267:
--

We can debug the unit test method testPipelineRecoveryWithSlowNode to verify 
this PR.

Set breakpoint in DataStreamer#run : `LOG.debug("{} sending {}", this, one);`.

We can see the DFSPacket with seq=3 sends twice.

> Client send the smae packet multiple times when method markSlowNode throws 
> IOException.
> ---
>
> Key: HDFS-17267
> URL: https://issues.apache.org/jira/browse/HDFS-17267
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.3.6
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>
> Since we have HDFS-16348, we can kick out SLOW node in pipeline when writing 
> data to pipeline. 
> And I think it introduced a problem, that is the same packet will be sent 
> twice or more times when we kick out SLOW node.
>  
> The flow are as below:
> 1、 DFSPacket p1 is pushed into dataQueue.
> 2、DataStreamer takes DFSPacket p1 from dataQueue.
> 3、Remove p1 from dataQueue and   push p1 into ackQueue.
> 4、sendPacket(p1).
> 5、In ResponseProcessor#run,  read pipelineAck for p1.
> 6、We meet SlOW node,  so method markSlowNode throw IOException and does not 
> execute `ackQueue.removeFirst();`.
> 7、In next loop of DataStreamer#run, we come into method 
> processDatanodeOrExternalError and execute `dataQueue.addAll(0, ackQueue);`.
> 8、the p1 will be sent repeatedly.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



  1   2   3   >