[jira] [Updated] (HDFS-15273) CacheReplicationMonitor hold lock for long time and lead to NN out of service

2022-03-12 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15273:
---
Attachment: HDFS-15273.002.patch

> CacheReplicationMonitor hold lock for long time and lead to NN out of service
> -
>
> Key: HDFS-15273
> URL: https://issues.apache.org/jira/browse/HDFS-15273
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: caching, namenode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15273.001.patch, HDFS-15273.002.patch
>
>
> CacheReplicationMonitor scan Cache Directives and Cached BlockMap 
> periodically. If we add more and more cache directives, 
> CacheReplicationMonitor will cost very long time to rescan all of cache 
> directives and cache blocks. Meanwhile, scan operation hold global write 
> lock, during scan period, NameNode could not process other request.
> So I think we should warn this risk to end user who turn on CacheManager 
> feature before improve this implement.
> {code:java}
>   private void rescan() throws InterruptedException {
> scannedDirectives = 0;
> scannedBlocks = 0;
> try {
>   namesystem.writeLock();
>   try {
> lock.lock();
> if (shutdown) {
>   throw new InterruptedException("CacheReplicationMonitor was " +
>   "shut down.");
> }
> curScanCount = completedScanCount + 1;
>   } finally {
> lock.unlock();
>   }
>   resetStatistics();
>   rescanCacheDirectives();
>   rescanCachedBlockMap();
>   blockManager.getDatanodeManager().resetLastCachingDirectiveSentTime();
> } finally {
>   namesystem.writeUnlock();
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15273) CacheReplicationMonitor hold lock for long time and lead to NN out of service

2022-03-12 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17505270#comment-17505270
 ] 

Xiaoqiao He commented on HDFS-15273:


Thanks [~weichiu] for your reminder. update the patch with hdfs-default.xml. 
Let's wait what Yetus will say.

> CacheReplicationMonitor hold lock for long time and lead to NN out of service
> -
>
> Key: HDFS-15273
> URL: https://issues.apache.org/jira/browse/HDFS-15273
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: caching, namenode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15273.001.patch, HDFS-15273.002.patch
>
>
> CacheReplicationMonitor scan Cache Directives and Cached BlockMap 
> periodically. If we add more and more cache directives, 
> CacheReplicationMonitor will cost very long time to rescan all of cache 
> directives and cache blocks. Meanwhile, scan operation hold global write 
> lock, during scan period, NameNode could not process other request.
> So I think we should warn this risk to end user who turn on CacheManager 
> feature before improve this implement.
> {code:java}
>   private void rescan() throws InterruptedException {
> scannedDirectives = 0;
> scannedBlocks = 0;
> try {
>   namesystem.writeLock();
>   try {
> lock.lock();
> if (shutdown) {
>   throw new InterruptedException("CacheReplicationMonitor was " +
>   "shut down.");
> }
> curScanCount = completedScanCount + 1;
>   } finally {
> lock.unlock();
>   }
>   resetStatistics();
>   rescanCacheDirectives();
>   rescanCachedBlockMap();
>   blockManager.getDatanodeManager().resetLastCachingDirectiveSentTime();
> } finally {
>   namesystem.writeUnlock();
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16428) Source path with storagePolicy cause wrong typeConsumed while rename

2022-03-12 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16428:
---
Fix Version/s: 3.2.3
   3.3.3

Thanks [~prasad-acit] for your reminder, cherry-pick to branch-3.2 and 
branch-3.3.

> Source path with storagePolicy cause wrong typeConsumed while rename
> 
>
> Key: HDFS-16428
> URL: https://issues.apache.org/jira/browse/HDFS-16428
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs, namenode
>Reporter: lei w
>Assignee: lei w
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.3, 3.3.3
>
> Attachments: example.txt
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> When compute quota in rename operation , we use storage policy of the target 
> directory to compute src  quota usage. This will cause wrong value of 
> typeConsumed when source path was setted storage policy. I provided a unit 
> test to present this situation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15382) Split FsDatasetImpl from blockpool lock to blockpool volume lock

2022-02-20 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17495328#comment-17495328
 ] 

Xiaoqiao He commented on HDFS-15382:


Thanks [~yuanbo] for your feedback. [~Aiphag0] is working for pushing this 
feature forward now. The latest PR refer to 
https://github.com/apache/hadoop/pull/3941. Before that we have merged 
https://issues.apache.org/jira/browse/HDFS-16429. Welcome to any suggestions or 
work together here.
{quote}It seems not a  compatible feature{quote}
We do not create new feature branch for this improvement. IMO there is no any 
incompatible changes for end user now.

> Split FsDatasetImpl from blockpool lock to blockpool volume lock 
> -
>
> Key: HDFS-15382
> URL: https://issues.apache.org/jira/browse/HDFS-15382
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Mingxiang Li
>Assignee: Mingxiang Li
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-15382-sample.patch, image-2020-06-02-1.png, 
> image-2020-06-03-1.png
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> In HDFS-15180 we split lock to blockpool grain size.But when one volume is in 
> heavy load and will block other request which in same blockpool but different 
> volume.So we split lock to two leval to avoid this happend.And to improve 
> datanode performance.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15987) Improve oiv tool to parse fsimage file in parallel with delimited format

2022-03-22 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510993#comment-17510993
 ] 

Xiaoqiao He commented on HDFS-15987:


revert and re-check in to correct the git message.

> Improve oiv tool to parse fsimage file in parallel with delimited format
> 
>
> Key: HDFS-15987
> URL: https://issues.apache.org/jira/browse/HDFS-15987
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: Improve_oiv_tool_001.pdf
>
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> The purpose of this Jira is to improve oiv tool to parse fsimage file with 
> sub-sections (see -HDFS-14617-) in parallel with delmited format. 
> 1.Serial parsing is time-consuming
> The time to serially parse a large fsimage with delimited format (e.g. `hdfs 
> oiv -p Delimited -t  ...`) is as follows: 
> {code:java}
> 1) Loading string table: -> Not time consuming.
> 2) Loading inode references: -> Not time consuming
> 3) Loading directories in INode section: -> Slightly time consuming (3%)
> 4) Loading INode directory section:  -> A bit time consuming (11%)
> 5) Output:   -> Very time consuming (86%){code}
> Therefore, output is the most parallelized stage.
> 2.How to output in parallel
> The sub-sections are grouped in order, and each thread processes a group and 
> outputs it to the file corresponding to each thread, and finally merges the 
> output files.
> 3. The result of a test
> {code:java}
>  input fsimage file info:
>  3.4G, 12 sub-sections, 55976500 INodes
>  -
>  Threads TotalTime OutputTime MergeTime
>  1   18m37s 16m18s  –
>  48m7s  4m49s   41s{code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16504) Add parameter for NameNode to process getBloks request

2022-03-20 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16504.

Resolution: Fixed

> Add parameter for NameNode to process getBloks request
> --
>
> Key: HDFS-16504
> URL: https://issues.apache.org/jira/browse/HDFS-16504
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, namanode
>Affects Versions: 3.4.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> HDFS-13183  add a nice feature that Standby NameNode can process getBlocks 
> request to reduce Active load.  Namenode must set  `dfs.ha.allow.stale.reads 
> = true` to enable this feature. However, if we  set `dfs.ha.allow.stale.reads 
> = true`, Standby Namenode will be able to  process all read requests, which 
> may lead to yarn jobs fail  because  Standby Namenode is stale . 
> Maybe we should add a config `dfs.namenode.get-blocks.check.operation=false` 
> for namenode to disable check operation  when namenode process getBlocks 
> request.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16504) Add parameter for NameNode to process getBloks request

2022-03-20 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16504:
---
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
  Summary: Add parameter for NameNode to process getBloks request  
(was: add `dfs.namenode.get-blocks.check.operation` to enable or disable 
checkOperation when NNs process getBlocks)

Committed to trunk. Thanks [~max2049] for your contributions!

> Add parameter for NameNode to process getBloks request
> --
>
> Key: HDFS-16504
> URL: https://issues.apache.org/jira/browse/HDFS-16504
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, namanode
>Affects Versions: 3.4.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> HDFS-13183  add a nice feature that Standby NameNode can process getBlocks 
> request to reduce Active load.  Namenode must set  `dfs.ha.allow.stale.reads 
> = true` to enable this feature. However, if we  set `dfs.ha.allow.stale.reads 
> = true`, Standby Namenode will be able to  process all read requests, which 
> may lead to yarn jobs fail  because  Standby Namenode is stale . 
> Maybe we should add a config `dfs.namenode.get-blocks.check.operation=false` 
> for namenode to disable check operation  when namenode process getBlocks 
> request.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15273) CacheReplicationMonitor hold lock for long time and lead to NN out of service

2022-03-22 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511029#comment-17511029
 ] 

Xiaoqiao He commented on HDFS-15273:


Submit v003 patch and trigger Yetus again.
[~weichiu] Would you mind take another review. Thanks.

> CacheReplicationMonitor hold lock for long time and lead to NN out of service
> -
>
> Key: HDFS-15273
> URL: https://issues.apache.org/jira/browse/HDFS-15273
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: caching, namenode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15273.001.patch, HDFS-15273.002.patch, 
> HDFS-15273.003.patch
>
>
> CacheReplicationMonitor scan Cache Directives and Cached BlockMap 
> periodically. If we add more and more cache directives, 
> CacheReplicationMonitor will cost very long time to rescan all of cache 
> directives and cache blocks. Meanwhile, scan operation hold global write 
> lock, during scan period, NameNode could not process other request.
> So I think we should warn this risk to end user who turn on CacheManager 
> feature before improve this implement.
> {code:java}
>   private void rescan() throws InterruptedException {
> scannedDirectives = 0;
> scannedBlocks = 0;
> try {
>   namesystem.writeLock();
>   try {
> lock.lock();
> if (shutdown) {
>   throw new InterruptedException("CacheReplicationMonitor was " +
>   "shut down.");
> }
> curScanCount = completedScanCount + 1;
>   } finally {
> lock.unlock();
>   }
>   resetStatistics();
>   rescanCacheDirectives();
>   rescanCachedBlockMap();
>   blockManager.getDatanodeManager().resetLastCachingDirectiveSentTime();
> } finally {
>   namesystem.writeUnlock();
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15273) CacheReplicationMonitor hold lock for long time and lead to NN out of service

2022-03-22 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15273:
---
Attachment: HDFS-15273.003.patch

> CacheReplicationMonitor hold lock for long time and lead to NN out of service
> -
>
> Key: HDFS-15273
> URL: https://issues.apache.org/jira/browse/HDFS-15273
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: caching, namenode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15273.001.patch, HDFS-15273.002.patch, 
> HDFS-15273.003.patch
>
>
> CacheReplicationMonitor scan Cache Directives and Cached BlockMap 
> periodically. If we add more and more cache directives, 
> CacheReplicationMonitor will cost very long time to rescan all of cache 
> directives and cache blocks. Meanwhile, scan operation hold global write 
> lock, during scan period, NameNode could not process other request.
> So I think we should warn this risk to end user who turn on CacheManager 
> feature before improve this implement.
> {code:java}
>   private void rescan() throws InterruptedException {
> scannedDirectives = 0;
> scannedBlocks = 0;
> try {
>   namesystem.writeLock();
>   try {
> lock.lock();
> if (shutdown) {
>   throw new InterruptedException("CacheReplicationMonitor was " +
>   "shut down.");
> }
> curScanCount = completedScanCount + 1;
>   } finally {
> lock.unlock();
>   }
>   resetStatistics();
>   rescanCacheDirectives();
>   rescanCachedBlockMap();
>   blockManager.getDatanodeManager().resetLastCachingDirectiveSentTime();
> } finally {
>   namesystem.writeUnlock();
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16498) Fix NPE for checkBlockReportLease

2022-03-30 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16498.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks @tomscut for your contribution!

> Fix NPE for checkBlockReportLease
> -
>
> Key: HDFS-16498
> URL: https://issues.apache.org/jira/browse/HDFS-16498
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: image-2022-03-09-20-35-22-028.png, screenshot-1.png
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> During the restart of Namenode, a Datanode is not registered, but this 
> Datanode triggers FBR, which causes NPE.
> !image-2022-03-09-20-35-22-028.png|width=871,height=158!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16511) Improve lock type for ReplicaMap under fine-grain lock mode.

2022-03-30 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16511:
---
Summary: Improve lock type for ReplicaMap under fine-grain lock mode.  
(was: Change some frequent method lock type in ReplicaMap.)

> Improve lock type for ReplicaMap under fine-grain lock mode.
> 
>
> Key: HDFS-16511
> URL: https://issues.apache.org/jira/browse/HDFS-16511
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs
>Reporter: Mingxiang Li
>Assignee: Mingxiang Li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> In HDFS-16429 we make LightWeightResizableGSet to be thread safe, and  In 
> HDFS-15382 we have split lock to block pool grain locks.After these 
> improvement, we can change some method to acquire read lock replace to 
> acquire write lock.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16511) Improve lock type for ReplicaMap under fine-grain lock mode.

2022-03-31 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16511:
---
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Committed to trunk. Thanks @MingXiangLi  for your contributions!

> Improve lock type for ReplicaMap under fine-grain lock mode.
> 
>
> Key: HDFS-16511
> URL: https://issues.apache.org/jira/browse/HDFS-16511
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs
>Reporter: Mingxiang Li
>Assignee: Mingxiang Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> In HDFS-16429 we make LightWeightResizableGSet to be thread safe, and  In 
> HDFS-15382 we have split lock to block pool grain locks.After these 
> improvement, we can change some method to acquire read lock replace to 
> acquire write lock.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-15987) Improve oiv tool to parse fsimage file in parallel with delimited format

2022-03-22 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-15987.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~wanghongbing] for your contributions! 

> Improve oiv tool to parse fsimage file in parallel with delimited format
> 
>
> Key: HDFS-15987
> URL: https://issues.apache.org/jira/browse/HDFS-15987
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: Improve_oiv_tool_001.pdf
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> The purpose of this Jira is to improve oiv tool to parse fsimage file with 
> sub-sections (see -HDFS-14617-) in parallel with delmited format. 
> 1.Serial parsing is time-consuming
> The time to serially parse a large fsimage with delimited format (e.g. `hdfs 
> oiv -p Delimited -t  ...`) is as follows: 
> {code:java}
> 1) Loading string table: -> Not time consuming.
> 2) Loading inode references: -> Not time consuming
> 3) Loading directories in INode section: -> Slightly time consuming (3%)
> 4) Loading INode directory section:  -> A bit time consuming (11%)
> 5) Output:   -> Very time consuming (86%){code}
> Therefore, output is the most parallelized stage.
> 2.How to output in parallel
> The sub-sections are grouped in order, and each thread processes a group and 
> outputs it to the file corresponding to each thread, and finally merges the 
> output files.
> 3. The result of a test
> {code:java}
>  input fsimage file info:
>  3.4G, 12 sub-sections, 55976500 INodes
>  -
>  Threads TotalTime OutputTime MergeTime
>  1   18m37s 16m18s  –
>  48m7s  4m49s   41s{code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16429) Add DataSetLockManager to manage fine-grain locks for FsDataSetImpl

2022-01-27 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16429.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~Aiphag0] for your contributions!

> Add DataSetLockManager to manage fine-grain locks for FsDataSetImpl
> ---
>
> Key: HDFS-16429
> URL: https://issues.apache.org/jira/browse/HDFS-16429
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs
>Reporter: Mingxiang Li
>Assignee: Mingxiang Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> 1、Use lockManager to maintain two level lock for FsDataSetImpl.
> The simple lock model like this.Parts of implemented as follows
>  * As for finalizeReplica(),append(),createRbw()First get BlockPoolLock 
> read lock,and then get BlockPoolLock-volume-lock write lock.
>  * As for getStoredBlock(),getMetaDataInputStream()First get 
> BlockPoolLock read lock,and the then get BlockPoolLock-volume-lock read lock.
>  * As for deepCopyReplica(),getBlockReports() get the BlockPoolLock read lock.
>  * As for delete hold the BlockPoolLock write lock.
> 2、Make LightWeightResizableGSet become thread safe.It not become performance 
> bottleneck if we make it thread safe.We can reduce lock grain size for 
> ReplicaMap when make LightWeightResizableGSet thread safe.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16429) Add DataSetLockManager to manage fine-grain locks for FsDataSetImpl

2022-01-27 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16429:
---
Summary: Add DataSetLockManager to manage fine-grain locks for 
FsDataSetImpl  (was: Add DataSetLockManager to maintain locks for FsDataSetImpl)

> Add DataSetLockManager to manage fine-grain locks for FsDataSetImpl
> ---
>
> Key: HDFS-16429
> URL: https://issues.apache.org/jira/browse/HDFS-16429
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs
>Reporter: Mingxiang Li
>Assignee: Mingxiang Li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> 1、Use lockManager to maintain two level lock for FsDataSetImpl.
> The simple lock model like this.Parts of implemented as follows
>  * As for finalizeReplica(),append(),createRbw()First get BlockPoolLock 
> read lock,and then get BlockPoolLock-volume-lock write lock.
>  * As for getStoredBlock(),getMetaDataInputStream()First get 
> BlockPoolLock read lock,and the then get BlockPoolLock-volume-lock read lock.
>  * As for deepCopyReplica(),getBlockReports() get the BlockPoolLock read lock.
>  * As for delete hold the BlockPoolLock write lock.
> 2、Make LightWeightResizableGSet become thread safe.It not become performance 
> bottleneck if we make it thread safe.We can reduce lock grain size for 
> ReplicaMap when make LightWeightResizableGSet thread safe.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16402) Improve HeartbeatManager logic to avoid incorrect stats

2022-01-23 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16402:
---
Summary: Improve HeartbeatManager logic to avoid incorrect stats  (was: 
HeartbeatManager may cause incorrect stats)

> Improve HeartbeatManager logic to avoid incorrect stats
> ---
>
> Key: HDFS-16402
> URL: https://issues.apache.org/jira/browse/HDFS-16402
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2021-12-29-08-25-44-303.png, 
> image-2021-12-29-08-25-54-441.png
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> After reconfig {*}dfs.datanode.data.dir{*}, we found that the stats of the 
> Namenode Web became *negative* and there were many NPE in namenode logs. This 
> problem has been solved by HDFS-14042.
> !image-2021-12-29-08-25-54-441.png|width=681,height=293!
> !image-2021-12-29-08-25-44-303.png|width=677,height=180!
> However, if *HeartbeatManager#updateHeartbeat* and 
> *HeartbeatManager#updateLifeline* throw other exceptions, stats errors can 
> also occur. We should ensure that *stats.subtract()* and *stats.add()* are 
> transactional.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16402) Improve HeartbeatManager logic to avoid incorrect stats

2022-01-23 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16402.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~tomscut] for your reports and contributions!

> Improve HeartbeatManager logic to avoid incorrect stats
> ---
>
> Key: HDFS-16402
> URL: https://issues.apache.org/jira/browse/HDFS-16402
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: image-2021-12-29-08-25-44-303.png, 
> image-2021-12-29-08-25-54-441.png
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> After reconfig {*}dfs.datanode.data.dir{*}, we found that the stats of the 
> Namenode Web became *negative* and there were many NPE in namenode logs. This 
> problem has been solved by HDFS-14042.
> !image-2021-12-29-08-25-54-441.png|width=681,height=293!
> !image-2021-12-29-08-25-44-303.png|width=677,height=180!
> However, if *HeartbeatManager#updateHeartbeat* and 
> *HeartbeatManager#updateLifeline* throw other exceptions, stats errors can 
> also occur. We should ensure that *stats.subtract()* and *stats.add()* are 
> transactional.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16428) Source path with storagePolicy cause wrong typeConsumed while rename

2022-01-24 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16428:
---
Summary: Source path with storagePolicy cause wrong typeConsumed while 
rename  (was: Source path setted storagePolicy will cause wrong typeConsumed  
in rename operation)

> Source path with storagePolicy cause wrong typeConsumed while rename
> 
>
> Key: HDFS-16428
> URL: https://issues.apache.org/jira/browse/HDFS-16428
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs, namenode
>Reporter: lei w
>Assignee: lei w
>Priority: Major
>  Labels: pull-request-available
> Attachments: example.txt
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> When compute quota in rename operation , we use storage policy of the target 
> directory to compute src  quota usage. This will cause wrong value of 
> typeConsumed when source path was setted storage policy. I provided a unit 
> test to present this situation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16428) Source path with storagePolicy cause wrong typeConsumed while rename

2022-01-24 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16428.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~lei w] for your contributions!

> Source path with storagePolicy cause wrong typeConsumed while rename
> 
>
> Key: HDFS-16428
> URL: https://issues.apache.org/jira/browse/HDFS-16428
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs, namenode
>Reporter: lei w
>Assignee: lei w
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: example.txt
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> When compute quota in rename operation , we use storage policy of the target 
> directory to compute src  quota usage. This will cause wrong value of 
> typeConsumed when source path was setted storage policy. I provided a unit 
> test to present this situation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16429) Add DataSetLockManager to maintain locks for FsDataSetImpl

2022-01-24 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16429:
---
Fix Version/s: (was: 3.2.0)
 Target Version/s:   (was: 3.2.0)
Affects Version/s: (was: 3.2.0)

> Add DataSetLockManager to maintain locks for FsDataSetImpl
> --
>
> Key: HDFS-16429
> URL: https://issues.apache.org/jira/browse/HDFS-16429
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs
>Reporter: Mingxiang Li
>Assignee: Mingxiang Li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> 1、Use lockManager to maintain two level lock for FsDataSetImpl.
> The simple lock model like this.Parts of implemented as follows
>  * As for finalizeReplica(),append(),createRbw()First get BlockPoolLock 
> read lock,and then get BlockPoolLock-volume-lock write lock.
>  * As for getStoredBlock(),getMetaDataInputStream()First get 
> BlockPoolLock read lock,and the then get BlockPoolLock-volume-lock read lock.
>  * As for deepCopyReplica(),getBlockReports() get the BlockPoolLock read lock.
>  * As for delete hold the BlockPoolLock write lock.
> 2、Make LightWeightResizableGSet become thread safe.It not become performance 
> bottleneck if we make it thread safe.We can reduce lock grain size for 
> ReplicaMap when make LightWeightResizableGSet thread safe.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16516) fix filesystemshell wrong params

2022-04-11 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He reassigned HDFS-16516:
--

Assignee: guophilipse

> fix filesystemshell wrong params
> 
>
> Key: HDFS-16516
> URL: https://issues.apache.org/jira/browse/HDFS-16516
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 3.3.2
>Reporter: guophilipse
>Assignee: guophilipse
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Fix wrong param name in FileSystemShell



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16516) fix filesystemshell wrong params

2022-04-11 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16516.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~philipse] for your contributions.

> fix filesystemshell wrong params
> 
>
> Key: HDFS-16516
> URL: https://issues.apache.org/jira/browse/HDFS-16516
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 3.3.2
>Reporter: guophilipse
>Assignee: guophilipse
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Fix wrong param name in FileSystemShell



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17175) setOwner maybe set any user due to HDFS-16798

2023-09-03 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17761615#comment-17761615
 ] 

Xiaoqiao He commented on HDFS-17175:


Great, let's try to push this bugfix to checkin.

> setOwner maybe set any user due to HDFS-16798
> -
>
> Key: HDFS-17175
> URL: https://issues.apache.org/jira/browse/HDFS-17175
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: TangLin
>Priority: Major
>
> This mr may cause the corresponding relationship between t2i and i2t to be 
> inconsistent. We need revert it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16933) A race in SerialNumberMap will cause wrong owner, group and XATTR

2023-09-05 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16933.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> A race in SerialNumberMap will cause wrong owner, group and XATTR
> -
>
> Key: HDFS-16933
> URL: https://issues.apache.org/jira/browse/HDFS-16933
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> If namenode enables parallel fsimage loading, a race that occurs in 
> SerialNumberMap will cause wrong owner ship for INodes.
> {code:java}
> public int get(T t) {
>   if (t == null) {
> return 0;
>   }
>   Integer sn = t2i.get(t);
>   if (sn == null) {
> // Assume there are two thread with different t, such as:
> // T1 with hbase
> // T2 with hdfs
> // If T1 and T2 get the sn in the same time, they will get the same sn, 
> such as 10
> sn = current.getAndIncrement();
> if (sn > max) {
>   current.getAndDecrement();
>   throw new IllegalStateException(name + ": serial number map is full");
> }
> Integer old = t2i.putIfAbsent(t, sn);
> if (old != null) {
>   current.getAndDecrement();
>   return old;
> }
> // If T1 puts the 10->hbase to the i2t first, T2 will use 10 -> hdfs to 
> overwrite it. So it will cause that the Inodes will get a wrong owner hdfs, 
> actual it should be hbase.
> i2t.put(sn, t);
>   }
>   return sn;
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16933) A race in SerialNumberMap will cause wrong owner, group and XATTR

2023-09-05 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16933:
---
Summary: A race in SerialNumberMap will cause wrong owner, group and XATTR  
(was: A race in SerialNumberMap will cause wrong ownership)

> A race in SerialNumberMap will cause wrong owner, group and XATTR
> -
>
> Key: HDFS-16933
> URL: https://issues.apache.org/jira/browse/HDFS-16933
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>
> If namenode enables parallel fsimage loading, a race that occurs in 
> SerialNumberMap will cause wrong owner ship for INodes.
> {code:java}
> public int get(T t) {
>   if (t == null) {
> return 0;
>   }
>   Integer sn = t2i.get(t);
>   if (sn == null) {
> // Assume there are two thread with different t, such as:
> // T1 with hbase
> // T2 with hdfs
> // If T1 and T2 get the sn in the same time, they will get the same sn, 
> such as 10
> sn = current.getAndIncrement();
> if (sn > max) {
>   current.getAndDecrement();
>   throw new IllegalStateException(name + ": serial number map is full");
> }
> Integer old = t2i.putIfAbsent(t, sn);
> if (old != null) {
>   current.getAndDecrement();
>   return old;
> }
> // If T1 puts the 10->hbase to the i2t first, T2 will use 10 -> hdfs to 
> overwrite it. So it will cause that the Inodes will get a wrong owner hdfs, 
> actual it should be hbase.
> i2t.put(sn, t);
>   }
>   return sn;
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17140) Revisit the BPOfferService.reportBadBlocks() method.

2023-09-06 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17140.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Revisit the BPOfferService.reportBadBlocks() method.
> 
>
> Key: HDFS-17140
> URL: https://issues.apache.org/jira/browse/HDFS-17140
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Liangjun He
>Assignee: Liangjun He
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> The current BPOfferService.reportBadBlocks() method can be optimized by 
> moving the creation of the rbbAction object outside the loop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17140) Revisit the BPOfferService.reportBadBlocks() method.

2023-09-06 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-17140:
---
Summary: Revisit the BPOfferService.reportBadBlocks() method.  (was: 
Optimize the BPOfferService.reportBadBlocks() method)

> Revisit the BPOfferService.reportBadBlocks() method.
> 
>
> Key: HDFS-17140
> URL: https://issues.apache.org/jira/browse/HDFS-17140
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Liangjun He
>Assignee: Liangjun He
>Priority: Minor
>  Labels: pull-request-available
>
> The current BPOfferService.reportBadBlocks() method can be optimized by 
> moving the creation of the rbbAction object outside the loop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17175) setOwner maybe set any user due to HDFS-16798

2023-08-31 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17761148#comment-17761148
 ] 

Xiaoqiao He commented on HDFS-17175:


Thanks [~Linwood] for your report. It is true for the first feeling. Would you 
mind to add one unit test to cover this case? cc [~xuzq_zander] Please give 
another confirm. Thanks.

> setOwner maybe set any user due to HDFS-16798
> -
>
> Key: HDFS-17175
> URL: https://issues.apache.org/jira/browse/HDFS-17175
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: TangLin
>Priority: Major
>
> This mr may cause the corresponding relationship between t2i and i2t to be 
> inconsistent. We need revert it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17241) long write lock on active NN from rollEditLog()

2023-10-29 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17780797#comment-17780797
 ] 

Xiaoqiao He commented on HDFS-17241:


Hi [~shuaiqi.guo], Thanks for your report and contribution. Would you mind to 
submit PR via GitHub reference this guide, 
https://cwiki.apache.org/confluence/display/HADOOP/GitHub+Integration Thanks.

> long write lock on active NN from rollEditLog()
> ---
>
> Key: HDFS-17241
> URL: https://issues.apache.org/jira/browse/HDFS-17241
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.1.2
>Reporter: shuaiqi.guo
>Priority: Major
> Attachments: HDFS-17241.patch
>
>
> when standby NN triggering log roll on active NN and sending fsimage to 
> active NN at the same time, the active NN will have a long write lock, which 
> blocks almost all requests. like:
> {code:java}
> INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem write 
> lock held for 27179 ms via java.lang.Thread.getStackTrace(Thread.java:1559)
> org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:273)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:235)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1617)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:4663)
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:1292)
> org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:146)
> org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:12974)
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)
> java.security.AccessController.doPrivileged(Native Method)
> javax.security.auth.Subject.doAs(Subject.java:422)
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
> org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17231) HA: Safemode should exit when resources are from low to available

2023-10-24 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17231.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> HA: Safemode should exit when resources are from low to available
> -
>
> Key: HDFS-17231
> URL: https://issues.apache.org/jira/browse/HDFS-17231
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha
>Affects Versions: 3.3.4, 3.3.6
>Reporter: kuper
>Assignee: kuper
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: 企业微信截图_75d15d37-26b7-4d88-ac0c-8d77e358761b.png
>
>
> The NameNodeResourceMonitor automatically enters safe mode when it detects 
> that the resources are not sufficient. When zkfc detects insufficient 
> resources, it triggers failover. Consider the following scenario:
>  * Initially, nn01 is active and nn02 is standby. Due to insufficient 
> resources in dfs.namenode.name.dir, the NameNodeResourceMonitor detects the 
> resource issue and puts nn01 into safemode. Subsequently, zkfc triggers 
> failover.
>  * At this point, nn01 is in safemode (ON) and standby, while nn02 is in 
> safemode (OFF) and active.
>  * After a period of time, the resources in nn01's dfs.namenode.name.dir 
> recover, causing a slight instability and triggering failover again.
>  * Now, nn01 is in safe mode (ON) and active, while nn02 is in safe mode 
> (OFF) and standby.
>  * However, since nn01 is active but in safemode (ON), hdfs cannot be read 
> from or written to.
> !企业微信截图_75d15d37-26b7-4d88-ac0c-8d77e358761b.png!
> *reproduction*
>  # Increase the dfs.namenode.resource.du.reserved
>  # Increase the ha.health-monitor.check-interval.ms can avoid directly 
> switching to standby and stopping the NameNodeResourceMonitor thread. 
> Instead, it is necessary to wait for the NameNodeResourceMonitor to enter 
> safe mode before switching to standby.
>  # On the nn01 active node, using the dd command to create a file that 
> exceeds the threshold, triggering a low on available disk space condition. 
>  # If the nn01 namenode process is not dead, the situation of nn01 safemode 
> (ON) and standby occurs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16016) BPServiceActor add a new thread to handle IBR

2023-11-06 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17783162#comment-17783162
 ] 

Xiaoqiao He commented on HDFS-16016:


[~yuanbo] Thanks involve me here again. IIRC, we mentioned times that it is 
great improvement, but we have to take the correct order when process IBR/FBR 
aync. 

{quote}The heartbeat thread not only handle FBR but also dispatch deleting 
blocks commands. When the disks of DN are quite busy, the whole execution time 
of heartbeat thread will be longer than one minute, and it makes files cannot 
be closed in time{quote}

Invalidate blocks have been aync by default, would you mind to offer stack to 
show where it blocks now. Thanks.

> BPServiceActor add a new thread to handle IBR
> -
>
> Key: HDFS-16016
> URL: https://issues.apache.org/jira/browse/HDFS-16016
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: JiangHua Zhu
>Assignee: Viraj Jasani
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.6
>
> Attachments: image-2023-11-03-18-11-54-502.png, 
> image-2023-11-06-10-53-13-584.png, image-2023-11-06-10-55-50-939.png
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Now BPServiceActor#offerService() is doing many things, FBR, IBR, heartbeat. 
> We can handle IBR independently to improve the performance of heartbeat and 
> FBR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17190) EC: Fix bug of OIV processing XAttr.

2023-09-17 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17190.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
 Assignee: Shuyan Zhang
   Resolution: Fixed

> EC: Fix bug of OIV processing XAttr.
> 
>
> Key: HDFS-17190
> URL: https://issues.apache.org/jira/browse/HDFS-17190
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> When we need to use OIV to print EC information for a directory, 
> `PBImageTextWriter#getErasureCodingPolicyName` will be called. Currently, 
> this method uses `XATTR_ERASURECODING_POLICY.contains(xattr.getName())` to 
> filter and obtain EC XAttr, which is very dangerous. If we have an XAttr 
> whose name happens to be a substring of `hdfs.erasurecoding.policy`, then 
> `getErasureCodingPolicyName` will return the wrong result. Our internal 
> production environment has customized some XAttrs, and this bug caused errors 
> in the parsing results of OIV when using `-ec` option. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17197) Show file replication when listing corrupt files.

2023-09-24 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He reassigned HDFS-17197:
--

Assignee: Shuyan Zhang

> Show file replication when listing corrupt files.
> -
>
> Key: HDFS-17197
> URL: https://issues.apache.org/jira/browse/HDFS-17197
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> Files with different replication have different reliability guarantees. We 
> need to pay attention to corrupted files with a specified replication greater 
> than or equal to 3. So, when listing corrupt files, it would be useful to 
> display the corresponding replication of the files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17197) Show file replication when listing corrupt files.

2023-09-24 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17197.

Fix Version/s: 3.4.0
 Hadoop Flags: Incompatible change,Reviewed
   Resolution: Fixed

> Show file replication when listing corrupt files.
> -
>
> Key: HDFS-17197
> URL: https://issues.apache.org/jira/browse/HDFS-17197
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Files with different replication have different reliability guarantees. We 
> need to pay attention to corrupted files with a specified replication greater 
> than or equal to 3. So, when listing corrupt files, it would be useful to 
> display the corresponding replication of the files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17105) mistakenly purge editLogs even after it is empty in NNStorageRetentionManager

2023-09-19 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17105.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
 Assignee: ConfX
   Resolution: Fixed

>  mistakenly purge editLogs even after it is empty in NNStorageRetentionManager
> --
>
> Key: HDFS-17105
> URL: https://issues.apache.org/jira/browse/HDFS-17105
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ConfX
>Assignee: ConfX
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: reproduce.sh
>
>
> h2. What happened:
> Got {{IndexOutOfBoundsException}} after setting 
> {{dfs.namenode.max.extra.edits.segments.retained}} to a negative value and 
> purging old record with {{{}NNStorageRetentionManager{}}}.
> h2. Where's the bug:
> In line 156 of {{{}NNStorageRetentionManager{}}}, the manager trims 
> {{editLogs}} until it is under the {{{}maxExtraEditsSegmentsToRetain{}}}:
> {noformat}
> while (editLogs.size() > maxExtraEditsSegmentsToRetain) {
>       purgeLogsFrom = editLogs.get(0).getLastTxId() + 1;
>       editLogs.remove(0);
> }{noformat}
> However, if {{dfs.namenode.max.extra.edits.segments.retained}} is set to 
> below 0 the size of {{editLogs}} would never be below, resulting in 
> ultimately {{editLog.size()=0}} and thus {{editLogs.get(0)}} is out of range.
> h2. How to reproduce:
> (1) Set {{dfs.namenode.max.extra.edits.segments.retained}} to -1974676133
> (2) Run test: 
> {{org.apache.hadoop.hdfs.server.namenode.TestNNStorageRetentionManager#testNoLogs}}
> h2. Stacktrace:
> {noformat}
> java.lang.IndexOutOfBoundsException: Index 0 out of bounds for length 0
>     at 
> java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
>     at 
> java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
>     at 
> java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:248)
>     at java.base/java.util.Objects.checkIndex(Objects.java:372)
>     at java.base/java.util.ArrayList.get(ArrayList.java:459)
>     at 
> org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:157)
>     at 
> org.apache.hadoop.hdfs.server.namenode.TestNNStorageRetentionManager.runTest(TestNNStorageRetentionManager.java:299)
>     at 
> org.apache.hadoop.hdfs.server.namenode.TestNNStorageRetentionManager.testNoLogs(TestNNStorageRetentionManager.java:143){noformat}
> For an easy reproduction, run the reproduce.sh in the attachment.
> We are happy to provide a patch if this issue is confirmed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17105) mistakenly purge editLogs even after it is empty in NNStorageRetentionManager

2023-09-19 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-17105:
---
Priority: Minor  (was: Critical)

>  mistakenly purge editLogs even after it is empty in NNStorageRetentionManager
> --
>
> Key: HDFS-17105
> URL: https://issues.apache.org/jira/browse/HDFS-17105
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ConfX
>Assignee: ConfX
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: reproduce.sh
>
>
> h2. What happened:
> Got {{IndexOutOfBoundsException}} after setting 
> {{dfs.namenode.max.extra.edits.segments.retained}} to a negative value and 
> purging old record with {{{}NNStorageRetentionManager{}}}.
> h2. Where's the bug:
> In line 156 of {{{}NNStorageRetentionManager{}}}, the manager trims 
> {{editLogs}} until it is under the {{{}maxExtraEditsSegmentsToRetain{}}}:
> {noformat}
> while (editLogs.size() > maxExtraEditsSegmentsToRetain) {
>       purgeLogsFrom = editLogs.get(0).getLastTxId() + 1;
>       editLogs.remove(0);
> }{noformat}
> However, if {{dfs.namenode.max.extra.edits.segments.retained}} is set to 
> below 0 the size of {{editLogs}} would never be below, resulting in 
> ultimately {{editLog.size()=0}} and thus {{editLogs.get(0)}} is out of range.
> h2. How to reproduce:
> (1) Set {{dfs.namenode.max.extra.edits.segments.retained}} to -1974676133
> (2) Run test: 
> {{org.apache.hadoop.hdfs.server.namenode.TestNNStorageRetentionManager#testNoLogs}}
> h2. Stacktrace:
> {noformat}
> java.lang.IndexOutOfBoundsException: Index 0 out of bounds for length 0
>     at 
> java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
>     at 
> java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
>     at 
> java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:248)
>     at java.base/java.util.Objects.checkIndex(Objects.java:372)
>     at java.base/java.util.ArrayList.get(ArrayList.java:459)
>     at 
> org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:157)
>     at 
> org.apache.hadoop.hdfs.server.namenode.TestNNStorageRetentionManager.runTest(TestNNStorageRetentionManager.java:299)
>     at 
> org.apache.hadoop.hdfs.server.namenode.TestNNStorageRetentionManager.testNoLogs(TestNNStorageRetentionManager.java:143){noformat}
> For an easy reproduction, run the reproduce.sh in the attachment.
> We are happy to provide a patch if this issue is confirmed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17192) Add bock info when constructing remote block reader meets IOException

2023-09-18 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17192.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Add bock info when constructing remote block reader meets IOException
> -
>
> Key: HDFS-17192
> URL: https://issues.apache.org/jira/browse/HDFS-17192
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Affects Versions: 3.4.0
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Currently, when constructing remote block reader meets IOException, it will 
> not log block info. We should add it for troubleshooting problems.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17184) Improve BlockReceiver to throws DiskOutOfSpaceException when initialize

2023-09-21 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-17184:
---
Summary: Improve BlockReceiver to throws DiskOutOfSpaceException when 
initialize  (was: In BlockReceiver constructor method need throw 
DiskOutOfSpaceException)

> Improve BlockReceiver to throws DiskOutOfSpaceException when initialize
> ---
>
> Key: HDFS-17184
> URL: https://issues.apache.org/jira/browse/HDFS-17184
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
>
> BlockReceiver class will receives a block and writes to its disk,
> in the constructor method, createTemporary and createRbw will execute 
> chooseVolume, and DiskOutOfSpaceException may occur in chooseVolume.
> current in the processing logic, if the exception occurs will be cacth by 
> BlockReceiver.java line_282 catch(IOException ioe) here, and cleanupBlock() 
> will be executed here.
> since the replica of the current block has not been added to ReplicaMap, 
> executing cleanupBlock will throw ReplicaNotFoundException.
> the ReplicaNotFoundException exception will overwrite the actual 
> DiskOutOfSpaceException, resulting in inaccurate exception information.
> {code:java}
> BlockReceiver(final ExtendedBlock block, final StorageType storageType,
>   final DataInputStream in,
>   final String inAddr, final String myAddr,
>   final BlockConstructionStage stage, 
>   final long newGs, final long minBytesRcvd, final long maxBytesRcvd, 
>   final String clientname, final DatanodeInfo srcDataNode,
>   final DataNode datanode, DataChecksum requestedChecksum,
>   CachingStrategy cachingStrategy,
>   final boolean allowLazyPersist,
>   final boolean pinning,
>   final String storageId) throws IOException {
> try{
>   ...
>  } catch (ReplicaAlreadyExistsException bae) {
>throw bae;
>  } catch (ReplicaNotFoundException bne) {
>throw bne;
>  } catch(IOException ioe) {
>   if (replicaInfo != null) {
> replicaInfo.releaseAllBytesReserved();
>   }
>   IOUtils.closeStream(this); 
>   cleanupBlock();// if ReplicaMap does not exist  replica  will throw 
> ReplicaNotFoundException
>   ...
>   throw ioe;
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17184) Improve BlockReceiver to throws DiskOutOfSpaceException when initialize

2023-09-21 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17184.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Improve BlockReceiver to throws DiskOutOfSpaceException when initialize
> ---
>
> Key: HDFS-17184
> URL: https://issues.apache.org/jira/browse/HDFS-17184
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> BlockReceiver class will receives a block and writes to its disk,
> in the constructor method, createTemporary and createRbw will execute 
> chooseVolume, and DiskOutOfSpaceException may occur in chooseVolume.
> current in the processing logic, if the exception occurs will be cacth by 
> BlockReceiver.java line_282 catch(IOException ioe) here, and cleanupBlock() 
> will be executed here.
> since the replica of the current block has not been added to ReplicaMap, 
> executing cleanupBlock will throw ReplicaNotFoundException.
> the ReplicaNotFoundException exception will overwrite the actual 
> DiskOutOfSpaceException, resulting in inaccurate exception information.
> {code:java}
> BlockReceiver(final ExtendedBlock block, final StorageType storageType,
>   final DataInputStream in,
>   final String inAddr, final String myAddr,
>   final BlockConstructionStage stage, 
>   final long newGs, final long minBytesRcvd, final long maxBytesRcvd, 
>   final String clientname, final DatanodeInfo srcDataNode,
>   final DataNode datanode, DataChecksum requestedChecksum,
>   CachingStrategy cachingStrategy,
>   final boolean allowLazyPersist,
>   final boolean pinning,
>   final String storageId) throws IOException {
> try{
>   ...
>  } catch (ReplicaAlreadyExistsException bae) {
>throw bae;
>  } catch (ReplicaNotFoundException bne) {
>throw bne;
>  } catch(IOException ioe) {
>   if (replicaInfo != null) {
> replicaInfo.releaseAllBytesReserved();
>   }
>   IOUtils.closeStream(this); 
>   cleanupBlock();// if ReplicaMap does not exist  replica  will throw 
> ReplicaNotFoundException
>   ...
>   throw ioe;
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17204) EC: Reduce unnecessary log when processing excess redundancy.

2023-09-26 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-17204:
---
Component/s: ec

> EC: Reduce unnecessary log when processing excess redundancy.
> -
>
> Key: HDFS-17204
> URL: https://issues.apache.org/jira/browse/HDFS-17204
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: ec
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> This is a follow-up of 
> [HDFS-16964|https://issues.apache.org/jira/browse/HDFS-16964]. We now avoid 
> stale replicas when dealing with redundancy. This may result in redundant 
> replicas not being in the `nonExcess` set when we enter 
> `BlockManager#chooseExcessRedundancyStriped` (because the datanode where the 
> redundant replicas are located has not send FBR yet, so those replicas are 
> filtered out and not added to the `nonExcess` set). A further result is that 
> no excess storage type is selected and the log "excess types chosen for 
> block..." is printed. When a failover occurs, a large number of datanodes 
> become stale, which causes NameNodes to print a large number of unnecessary 
> logs.
> This issue needs to be fixed, otherwise the performance after failover will 
> be affected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17204) EC: Reduce unnecessary log when processing excess redundancy.

2023-09-26 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17204.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> EC: Reduce unnecessary log when processing excess redundancy.
> -
>
> Key: HDFS-17204
> URL: https://issues.apache.org/jira/browse/HDFS-17204
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> This is a follow-up of 
> [HDFS-16964|https://issues.apache.org/jira/browse/HDFS-16964]. We now avoid 
> stale replicas when dealing with redundancy. This may result in redundant 
> replicas not being in the `nonExcess` set when we enter 
> `BlockManager#chooseExcessRedundancyStriped` (because the datanode where the 
> redundant replicas are located has not send FBR yet, so those replicas are 
> filtered out and not added to the `nonExcess` set). A further result is that 
> no excess storage type is selected and the log "excess types chosen for 
> block..." is printed. When a failover occurs, a large number of datanodes 
> become stale, which causes NameNodes to print a large number of unnecessary 
> logs.
> This issue needs to be fixed, otherwise the performance after failover will 
> be affected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17218) NameNode should remove its excess blocks from the ExcessRedundancyMap When a DN registers

2023-10-08 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17773095#comment-17773095
 ] 

Xiaoqiao He commented on HDFS-17218:


Thanks for your report.
{quote}it causes some blocks in the excess map in the namenode to be leaked and 
this will result in many blocks having more replicas then expected.
{quote}
Here mentioned 'excess map will be leaked'. IIUC, after DN restart, it will 
send FBR, then send Invalidate commands to DN again, at DN 
 side, it will execute deletion when receive commands, then report BRD to NN 
again, excess map item will be removed. (even the first FBR could not execute 
completely, the second loop will also do it.)
I am confused why it will be leaked. Thanks.

> NameNode should remove its excess blocks from the ExcessRedundancyMap When a 
> DN registers
> -
>
> Key: HDFS-17218
> URL: https://issues.apache.org/jira/browse/HDFS-17218
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namanode
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>
> Currently found that DN will lose all pending DNA_INVALIDATE blocks if it 
> restarts.
> *Root case*
> Current DN enables asynchronously deletion, it have many pending deletion 
> blocks in memory.
> when DN restarts, these cached blocks may be lost. it causes some blocks in 
> the excess map in the namenode to be leaked and this will result in many 
> blocks having more replicas then expected.
> *solution*
> Consider NameNode should remove its excess blocks from the 
> ExcessRedundancyMap When a DN registers,
> this approach will ensure that when processing the DN's full block report, 
> the 'processExtraRedundancy' can be performed according to the actual of the 
> blocks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17217) Add lifeline RPC start up log when NameNode#startCommonServices

2023-10-12 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17217.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Marked Resolution/Fix Version and Reviewed Flag while the PR has committed to 
trunk.
cc [~zhangshuyan].

> Add lifeline RPC start up  log when NameNode#startCommonServices
> 
>
> Key: HDFS-17217
> URL: https://issues.apache.org/jira/browse/HDFS-17217
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> If start up the lifeline RPC server in NameNode and need add lifeline RPC 
> start up log when NameNode#startCommonServices



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17208) Add the metrics PendingAsyncDiskOperations in datanode

2023-10-12 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17208.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Add the metrics PendingAsyncDiskOperations  in datanode 
> 
>
> Key: HDFS-17208
> URL: https://issues.apache.org/jira/browse/HDFS-17208
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Consider should add  the metrics `PendingAsyncDiskOperations`  to be able to 
> track if we are queueing too many asynchronous disk operations in 
> FsDatasetAsyncDiskService of the datanode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17231) HA: Safemode should exit when resources are from low to available

2023-10-22 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17778520#comment-17778520
 ] 

Xiaoqiao He commented on HDFS-17231:


Add [~kuper] to contributor list and assign this JIRA to him/her.

> HA: Safemode should exit when resources are from low to available
> -
>
> Key: HDFS-17231
> URL: https://issues.apache.org/jira/browse/HDFS-17231
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha
>Affects Versions: 3.3.4, 3.3.6
>Reporter: kuper
>Assignee: kuper
>Priority: Major
>  Labels: pull-request-available
> Attachments: 企业微信截图_75d15d37-26b7-4d88-ac0c-8d77e358761b.png
>
>
> The NameNodeResourceMonitor automatically enters safe mode when it detects 
> that the resources are not sufficient. When zkfc detects insufficient 
> resources, it triggers failover. Consider the following scenario:
>  * Initially, nn01 is active and nn02 is standby. Due to insufficient 
> resources in dfs.namenode.name.dir, the NameNodeResourceMonitor detects the 
> resource issue and puts nn01 into safemode. Subsequently, zkfc triggers 
> failover.
>  * At this point, nn01 is in safemode (ON) and standby, while nn02 is in 
> safemode (OFF) and active.
>  * After a period of time, the resources in nn01's dfs.namenode.name.dir 
> recover, causing a slight instability and triggering failover again.
>  * Now, nn01 is in safe mode (ON) and active, while nn02 is in safe mode 
> (OFF) and standby.
>  * However, since nn01 is active but in safemode (ON), hdfs cannot be read 
> from or written to.
> !企业微信截图_75d15d37-26b7-4d88-ac0c-8d77e358761b.png!
> *reproduction*
>  # Increase the dfs.namenode.resource.du.reserved
>  # Increase the ha.health-monitor.check-interval.ms can avoid directly 
> switching to standby and stopping the NameNodeResourceMonitor thread. 
> Instead, it is necessary to wait for the NameNodeResourceMonitor to enter 
> safe mode before switching to standby.
>  # On the nn01 active node, using the dd command to create a file that 
> exceeds the threshold, triggering a low on available disk space condition. 
>  # If the nn01 namenode process is not dead, the situation of nn01 safemode 
> (ON) and standby occurs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17231) HA: Safemode should exit when resources are from low to available

2023-10-22 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He reassigned HDFS-17231:
--

Assignee: kuper

> HA: Safemode should exit when resources are from low to available
> -
>
> Key: HDFS-17231
> URL: https://issues.apache.org/jira/browse/HDFS-17231
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha
>Affects Versions: 3.3.4, 3.3.6
>Reporter: kuper
>Assignee: kuper
>Priority: Major
>  Labels: pull-request-available
> Attachments: 企业微信截图_75d15d37-26b7-4d88-ac0c-8d77e358761b.png
>
>
> The NameNodeResourceMonitor automatically enters safe mode when it detects 
> that the resources are not sufficient. When zkfc detects insufficient 
> resources, it triggers failover. Consider the following scenario:
>  * Initially, nn01 is active and nn02 is standby. Due to insufficient 
> resources in dfs.namenode.name.dir, the NameNodeResourceMonitor detects the 
> resource issue and puts nn01 into safemode. Subsequently, zkfc triggers 
> failover.
>  * At this point, nn01 is in safemode (ON) and standby, while nn02 is in 
> safemode (OFF) and active.
>  * After a period of time, the resources in nn01's dfs.namenode.name.dir 
> recover, causing a slight instability and triggering failover again.
>  * Now, nn01 is in safe mode (ON) and active, while nn02 is in safe mode 
> (OFF) and standby.
>  * However, since nn01 is active but in safemode (ON), hdfs cannot be read 
> from or written to.
> !企业微信截图_75d15d37-26b7-4d88-ac0c-8d77e358761b.png!
> *reproduction*
>  # Increase the dfs.namenode.resource.du.reserved
>  # Increase the ha.health-monitor.check-interval.ms can avoid directly 
> switching to standby and stopping the NameNodeResourceMonitor thread. 
> Instead, it is necessary to wait for the NameNodeResourceMonitor to enter 
> safe mode before switching to standby.
>  # On the nn01 active node, using the dd command to create a file that 
> exceeds the threshold, triggering a low on available disk space condition. 
>  # If the nn01 namenode process is not dead, the situation of nn01 safemode 
> (ON) and standby occurs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17218) NameNode should remove its excess blocks from the ExcessRedundancyMap When a DN registers

2023-10-09 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17773523#comment-17773523
 ] 

Xiaoqiao He commented on HDFS-17218:


{quote}6. dn restarts will FBR is executed (processFirstBlockReport will not be 
executed here, processReport will be executed). since block1 is not a new 
block, the processExtraRedundancy logic will not be executed.
7. so the block of dn1 will always exist in excessRedundancyMap (until HA 
switch is performed).{quote}

Would you mind to attach code snippet that shows the current implements about 6 
& 7? Thanks.

> NameNode should remove its excess blocks from the ExcessRedundancyMap When a 
> DN registers
> -
>
> Key: HDFS-17218
> URL: https://issues.apache.org/jira/browse/HDFS-17218
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namanode
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>
> Currently found that DN will lose all pending DNA_INVALIDATE blocks if it 
> restarts.
> *Root case*
> Current DN enables asynchronously deletion, it have many pending deletion 
> blocks in memory.
> when DN restarts, these cached blocks may be lost. it causes some blocks in 
> the excess map in the namenode to be leaked and this will result in many 
> blocks having more replicas then expected.
> *solution*
> Consider NameNode should remove its excess blocks from the 
> ExcessRedundancyMap When a DN registers,
> this approach will ensure that when processing the DN's full block report, 
> the 'processExtraRedundancy' can be performed according to the actual of the 
> blocks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17151) EC: Fix wrong metadata in BlockInfoStriped after recovery

2023-08-20 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17151.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> EC: Fix wrong metadata in BlockInfoStriped after recovery
> -
>
> Key: HDFS-17151
> URL: https://issues.apache.org/jira/browse/HDFS-17151
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> When the datanode completes a block recovery, it will call 
> `commitBlockSynchronization` method to notify NN the new locations of the 
> block. For a EC block group, NN determines the index of each internal block 
> based on the position of the DatanodeID in the parameter `newtargets`.
> If the internal blocks written by the client don't have continuous indices, 
> the current datanode code might cause NN to record incorrect block metadata. 
> For simplicity, let's take RS (3,2) as an example. The timeline of the 
> problem is as follows:
> 1. The client plans to write internal blocks with indices [0,1,2,3,4] to 
> datanode [dn0, dn1, dn2, dn3, dn4] respectively. But dn1 is unable to 
> connect, so the client only writes data to the remaining 4 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the content of `uc. getExpectedStorageLocations()` completely depends 
> on block reports, and now it is ;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. Datanode that receives the recovery command fills `DatanodeID [] newLocs` 
> with [dn0, null, dn2, dn3, dn4];
> 7. The serialization process filters out null values, so the parameters 
> passed to NN become [dn0, dn2, dn3, dn4];
> 8. NN mistakenly believes that dn2 stores an internal block with index 1, dn3 
> stores an internal block with index 2, and so on.
> The above timeline is just an example, and there are other situations that 
> may result in the same error, such as an update pipeline occurs on the client 
> side. We should fix this bug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17151) EC: Fix wrong metadata in BlockInfoStriped after recovery

2023-08-20 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He reassigned HDFS-17151:
--

Assignee: Shuyan Zhang

> EC: Fix wrong metadata in BlockInfoStriped after recovery
> -
>
> Key: HDFS-17151
> URL: https://issues.apache.org/jira/browse/HDFS-17151
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> When the datanode completes a block recovery, it will call 
> `commitBlockSynchronization` method to notify NN the new locations of the 
> block. For a EC block group, NN determines the index of each internal block 
> based on the position of the DatanodeID in the parameter `newtargets`.
> If the internal blocks written by the client don't have continuous indices, 
> the current datanode code might cause NN to record incorrect block metadata. 
> For simplicity, let's take RS (3,2) as an example. The timeline of the 
> problem is as follows:
> 1. The client plans to write internal blocks with indices [0,1,2,3,4] to 
> datanode [dn0, dn1, dn2, dn3, dn4] respectively. But dn1 is unable to 
> connect, so the client only writes data to the remaining 4 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the content of `uc. getExpectedStorageLocations()` completely depends 
> on block reports, and now it is ;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. Datanode that receives the recovery command fills `DatanodeID [] newLocs` 
> with [dn0, null, dn2, dn3, dn4];
> 7. The serialization process filters out null values, so the parameters 
> passed to NN become [dn0, dn2, dn3, dn4];
> 8. NN mistakenly believes that dn2 stores an internal block with index 1, dn3 
> stores an internal block with index 2, and so on.
> The above timeline is just an example, and there are other situations that 
> may result in the same error, such as an update pipeline occurs on the client 
> side. We should fix this bug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17154) EC: Fix bug in updateBlockForPipeline after failover

2023-08-16 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17154.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> EC: Fix bug in updateBlockForPipeline after failover
> 
>
> Key: HDFS-17154
> URL: https://issues.apache.org/jira/browse/HDFS-17154
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> In the method `updateBlockForPipeline`, NameNode uses the 
> `BlockUnderConstructionFeature` of a BlockInfo to generate the member 
> `blockIndices` of `LocatedStripedBlock`. 
> And then, NameNode uses `blockIndices` to generate block tokens for client.
> However, if there is a failover, the location info in 
> BlockUnderConstructionFeature may be incomplete, which results in the absence 
> of the corresponding block tokens.
> When the client receives these incomplete block tokens, it will throw a NPE 
> because `updatedBlks[i]` is null.
> NameNode should just return block tokens for all indices to the client. 
> Client can pick whichever it likes to use. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17093) Fix block report lease issue to avoid missing some storages report.

2023-08-28 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-17093:
---
Summary: Fix block report lease issue to avoid missing some storages 
report.  (was: In the case of all datanodes sending FBR when the namenode 
restarts (large clusters), there is an issue with incomplete block reporting)

> Fix block report lease issue to avoid missing some storages report.
> ---
>
> Key: HDFS-17093
> URL: https://issues.apache.org/jira/browse/HDFS-17093
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Yanlei Yu
>Priority: Minor
>  Labels: pull-request-available
>
> In our cluster of 800+ nodes, after restarting the namenode, we found that 
> some datanodes did not report enough blocks, causing the namenode to stay in 
> secure mode for a long time after restarting because of incomplete block 
> reporting
> I found in the logs of the datanode with incomplete block reporting that the 
> first FBR attempt failed, possibly due to namenode stress, and then a second 
> FBR attempt was made as follows:
> {code:java}
> 
> 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Unsuccessfully sent block report 0x6237a52c1e817e,  containing 12 storage 
> report(s), of which we sent 1. The reports had 1099057 total blocks and used 
> 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN 
> processing. Got back no commands.
> 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Successfully sent block report 0x62382416f3f055,  containing 12 storage 
> report(s), of which we sent 12. The reports had 1099048 total blocks and used 
> 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN 
> processing. Got back no commands. {code}
> There's nothing wrong with that. Retry the send if it fails But on the 
> namenode side of the logic:
> {code:java}
> if (namesystem.isInStartupSafeMode()
>     && !StorageType.PROVIDED.equals(storageInfo.getStorageType())
>     && storageInfo.getBlockReportCount() > 0) {
>   blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: "
>       + "discarded non-initial block report from {}"
>       + " because namenode still in startup phase",
>       strBlockReportId, fullBrLeaseId, nodeID);
>   blockReportLeaseManager.removeLease(node);
>   return !node.hasStaleStorages();
> } {code}
> When a disk was identified as the report is not the first time, namely 
> storageInfo. GetBlockReportCount > 0, Will remove the ticket from the 
> datanode, lead to a second report failed because no lease



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17093) Fix block report lease issue to avoid missing some storages report.

2023-08-28 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17093.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Fix block report lease issue to avoid missing some storages report.
> ---
>
> Key: HDFS-17093
> URL: https://issues.apache.org/jira/browse/HDFS-17093
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Yanlei Yu
>Assignee: Yanlei Yu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> In our cluster of 800+ nodes, after restarting the namenode, we found that 
> some datanodes did not report enough blocks, causing the namenode to stay in 
> secure mode for a long time after restarting because of incomplete block 
> reporting
> I found in the logs of the datanode with incomplete block reporting that the 
> first FBR attempt failed, possibly due to namenode stress, and then a second 
> FBR attempt was made as follows:
> {code:java}
> 
> 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Unsuccessfully sent block report 0x6237a52c1e817e,  containing 12 storage 
> report(s), of which we sent 1. The reports had 1099057 total blocks and used 
> 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN 
> processing. Got back no commands.
> 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Successfully sent block report 0x62382416f3f055,  containing 12 storage 
> report(s), of which we sent 12. The reports had 1099048 total blocks and used 
> 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN 
> processing. Got back no commands. {code}
> There's nothing wrong with that. Retry the send if it fails But on the 
> namenode side of the logic:
> {code:java}
> if (namesystem.isInStartupSafeMode()
>     && !StorageType.PROVIDED.equals(storageInfo.getStorageType())
>     && storageInfo.getBlockReportCount() > 0) {
>   blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: "
>       + "discarded non-initial block report from {}"
>       + " because namenode still in startup phase",
>       strBlockReportId, fullBrLeaseId, nodeID);
>   blockReportLeaseManager.removeLease(node);
>   return !node.hasStaleStorages();
> } {code}
> When a disk was identified as the report is not the first time, namely 
> storageInfo. GetBlockReportCount > 0, Will remove the ticket from the 
> datanode, lead to a second report failed because no lease



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17093) Fix block report lease issue to avoid missing some storages report.

2023-08-28 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He reassigned HDFS-17093:
--

Assignee: Yanlei Yu

> Fix block report lease issue to avoid missing some storages report.
> ---
>
> Key: HDFS-17093
> URL: https://issues.apache.org/jira/browse/HDFS-17093
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Yanlei Yu
>Assignee: Yanlei Yu
>Priority: Minor
>  Labels: pull-request-available
>
> In our cluster of 800+ nodes, after restarting the namenode, we found that 
> some datanodes did not report enough blocks, causing the namenode to stay in 
> secure mode for a long time after restarting because of incomplete block 
> reporting
> I found in the logs of the datanode with incomplete block reporting that the 
> first FBR attempt failed, possibly due to namenode stress, and then a second 
> FBR attempt was made as follows:
> {code:java}
> 
> 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Unsuccessfully sent block report 0x6237a52c1e817e,  containing 12 storage 
> report(s), of which we sent 1. The reports had 1099057 total blocks and used 
> 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN 
> processing. Got back no commands.
> 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Successfully sent block report 0x62382416f3f055,  containing 12 storage 
> report(s), of which we sent 12. The reports had 1099048 total blocks and used 
> 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN 
> processing. Got back no commands. {code}
> There's nothing wrong with that. Retry the send if it fails But on the 
> namenode side of the logic:
> {code:java}
> if (namesystem.isInStartupSafeMode()
>     && !StorageType.PROVIDED.equals(storageInfo.getStorageType())
>     && storageInfo.getBlockReportCount() > 0) {
>   blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: "
>       + "discarded non-initial block report from {}"
>       + " because namenode still in startup phase",
>       strBlockReportId, fullBrLeaseId, nodeID);
>   blockReportLeaseManager.removeLease(node);
>   return !node.hasStaleStorages();
> } {code}
> When a disk was identified as the report is not the first time, namely 
> storageInfo. GetBlockReportCount > 0, Will remove the ticket from the 
> datanode, lead to a second report failed because no lease



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17149) getBlockLocations RPC should use actual client ip to compute network distance when using RBF.

2023-08-13 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17753895#comment-17753895
 ] 

Xiaoqiao He commented on HDFS-17149:


[~zhanghaobo], Thanks, IIRC, there are many and many times we have discussed 
about locality for RBF, and from now on, still not agreement for some cases, 
such as #getBlockLocations here. I think we should push it forwards again thus 
enhance RBF for prod cluster.  cc [~elgoiri], [~ayushtkn] what do you think 
about?

> getBlockLocations RPC should use actual client ip to compute network distance 
> when using RBF.
> -
>
> Key: HDFS-17149
> URL: https://issues.apache.org/jira/browse/HDFS-17149
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namanode
>Affects Versions: 3.4.0
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>
> Please correct me if i understand wrongly. Thanks.
> Currently, when a getBlockLocations RPC forwards to namenode via router.  
> NameNode will use router ip address as client machine to compute network 
> distance against block's locations. See FSNamesystem#sortLocatedBlocks method 
> for more detailed information.  
> I think this compute method is not correct and should use actual client ip.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-15 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17150.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-15 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He reassigned HDFS-17150:
--

Assignee: Shuyan Zhang

> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16531) Avoid setReplication logging an edit record if old replication equals the new value

2022-04-20 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17525408#comment-17525408
 ] 

Xiaoqiao He commented on HDFS-16531:


Thanks [~ayushtkn] for your kind information.
Based on above code segment you mentioned above, I am not sure why skip set 
same replication could impact the snapshot feature. My concern is that which 
side (replication/snapshot) implement is not expected? Thanks.
BTW, revert this changes is the safest operation for me also. I just wonder to 
dig the root cause. :)

> Avoid setReplication logging an edit record if old replication equals the new 
> value
> ---
>
> Key: HDFS-16531
> URL: https://issues.apache.org/jira/browse/HDFS-16531
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> I recently came across a NN log where about 800k setRep calls were made, 
> setting the replication from 3 to 3 - ie leaving it unchanged.
> Even in a case like this, we log an edit record, an audit log, and perform 
> some quota checks etc.
> I believe it should be possible to avoid some of the work if we check for 
> oldRep == newRep and jump out of the method early.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16500) Make asynchronous blocks deletion lock and unlock durtion threshold configurable

2022-04-20 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16500.

   Fix Version/s: 3.4.0
Hadoop Flags: Reviewed
Target Version/s:   (was: 3.3.1, 3.3.2)
  Resolution: Fixed

Committed to trunk. Thanks [~smarthan] for your contributions.

> Make asynchronous blocks deletion lock and unlock durtion threshold 
> configurable 
> -
>
> Key: HDFS-16500
> URL: https://issues.apache.org/jira/browse/HDFS-16500
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namanode
>Reporter: Chengwei Wang
>Assignee: Chengwei Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> I have backport the nice feature HDFS-16043 to our internal branch, it works 
> well in our testing cluster.
> I think it's better to make the fields *_deleteBlockLockTimeMs_* and 
> *_deleteBlockUnlockIntervalTimeMs_* configurable, so that we can control the 
> lock and unlock duration.
> {code:java}
> private final long deleteBlockLockTimeMs = 500;
> private final long deleteBlockUnlockIntervalTimeMs = 100;{code}
> And we should set the default value smaller to avoid blocking other requests 
> long time when deleting some  large directories.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16531) Avoid setReplication logging an edit record if old replication equals the new value

2022-04-20 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17525415#comment-17525415
 ] 

Xiaoqiao He commented on HDFS-16531:


[~hemanthboyina], Thanks very much for your reminder and the detailed report. 
It makes sense to me.

> Avoid setReplication logging an edit record if old replication equals the new 
> value
> ---
>
> Key: HDFS-16531
> URL: https://issues.apache.org/jira/browse/HDFS-16531
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> I recently came across a NN log where about 800k setRep calls were made, 
> setting the replication from 3 to 3 - ie leaving it unchanged.
> Even in a case like this, we log an edit record, an audit log, and perform 
> some quota checks etc.
> I believe it should be possible to avoid some of the work if we check for 
> oldRep == newRep and jump out of the method early.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16554) Remove unused configuration dfs.namenode.block.deletion.increment.

2022-04-26 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16554.

   Fix Version/s: 3.4.0
Hadoop Flags: Reviewed
Target Version/s:   (was: 3.4.0)
  Resolution: Fixed

Committed to trunk. Thanks [~smarthan] for your contribution!

> Remove unused configuration dfs.namenode.block.deletion.increment. 
> ---
>
> Key: HDFS-16554
> URL: https://issues.apache.org/jira/browse/HDFS-16554
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chengwei Wang
>Assignee: Chengwei Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The configuration *_dfs.namenode.block.deletion.increment_* will not be used 
> after the feature HDFS-16043 that do block deletetion asynchronously. So it's 
> better to remove it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16554) Remove unused configuration dfs.namenode.block.deletion.increment.

2022-04-26 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16554:
---
Component/s: namenode

> Remove unused configuration dfs.namenode.block.deletion.increment. 
> ---
>
> Key: HDFS-16554
> URL: https://issues.apache.org/jira/browse/HDFS-16554
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Chengwei Wang
>Assignee: Chengwei Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The configuration *_dfs.namenode.block.deletion.increment_* will not be used 
> after the feature HDFS-16043 that do block deletetion asynchronously. So it's 
> better to remove it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly

2022-07-03 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561997#comment-17561997
 ] 

Xiaoqiao He commented on HDFS-15079:


[~xuzq_zander] Thanks to involve me here. I am not fans about `CallerContext`, 
but considering this is serious issue, I totally support to push this solution 
forward to fix it first, then improve it if need.
Actually, in my practice, I try to extend RPC interface to support super user 
to get/set ClientId and CallId for Router, this solution is also applied to 
`data locality` and some other similar cases. It seems to work fine for more 
than two years.

> RBF: Client maybe get an unexpected result with network anomaly 
> 
>
> Key: HDFS-15079
> URL: https://issues.apache.org/jira/browse/HDFS-15079
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Hui Fei
>Priority: Critical
> Attachments: HDFS-15079.001.patch, HDFS-15079.002.patch, 
> UnexpectedOverWriteUT.patch
>
>
>  I find there is a critical problem on RBF, HDFS-15078 can resolve it on some 
> Scenarios, but i have no idea about the overall resolution.
> The problem is that
> Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and 
> failovers to r1
> r0 has been send create rpc to namenode(1st create)
> Client create a HDFS file via r1(2nd create)
> Client writes the HDFS file and close it finally(3rd close)
> Maybe namenode receiving the rpc in order as follow
> 2nd create
> 3rd close
> 1st create
> And overwrite is true by default, this would make the file had been written 
> an empty file. This is an critical problem 
> We had encountered this problem. There are many hive and spark jobs running 
> on our cluster,   sometimes it occurs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16628) RBF: Correct target directory when move to trash for kerberos login user.

2022-06-15 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16628.

Hadoop Flags: Reviewed
Target Version/s: 3.4.0
  Resolution: Fixed

Committed to trunk. Thanks [~zhangxiping].

> RBF: Correct target directory when move to trash for kerberos login user.
> -
>
> Key: HDFS-16628
> URL: https://issues.apache.org/jira/browse/HDFS-16628
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.4.0
>Reporter: Xiping Zhang
>Assignee: Xiping Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> remove data from the router will fail using such a user 
> username/d...@hadoop.com



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16628) RBF: Correct target directory when move to trash for kerberos login user.

2022-06-15 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He reassigned HDFS-16628:
--

Assignee: Xiping Zhang

> RBF: Correct target directory when move to trash for kerberos login user.
> -
>
> Key: HDFS-16628
> URL: https://issues.apache.org/jira/browse/HDFS-16628
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.4.0
>Reporter: Xiping Zhang
>Assignee: Xiping Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> remove data from the router will fail using such a user 
> username/d...@hadoop.com



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16628) RBF: Correct target directory when move to trash for kerberos login user.

2022-06-15 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16628:
---
Summary: RBF: Correct target directory when move to trash for kerberos 
login user.  (was: RBF: kerberos user remove Non-default namespace data failed)

> RBF: Correct target directory when move to trash for kerberos login user.
> -
>
> Key: HDFS-16628
> URL: https://issues.apache.org/jira/browse/HDFS-16628
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.4.0
>Reporter: Xiping Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> remove data from the router will fail using such a user 
> username/d...@hadoop.com



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16600) Fix deadlock of fine-grain lock for FsDatastImpl of DataNode.

2022-06-17 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16600.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~xuzq_zander] for your works!

> Fix deadlock of fine-grain lock for FsDatastImpl of DataNode.
> -
>
> Key: HDFS-16600
> URL: https://issues.apache.org/jira/browse/HDFS-16600
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> The UT 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.testSynchronousEviction 
> failed, because happened deadlock, which  is introduced by 
> [HDFS-16534|https://issues.apache.org/jira/browse/HDFS-16534]. 
> DeadLock:
> {code:java}
> // org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.createRbw line 1588 
> need a read lock
> try (AutoCloseableLock lock = lockManager.readLock(LockLevel.BLOCK_POOl,
> b.getBlockPoolId()))
> // org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.evictBlocks line 
> 3526 need a write lock
> try (AutoCloseableLock lock = lockManager.writeLock(LockLevel.BLOCK_POOl, 
> bpid))
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16600) Deadlock on DataNode

2022-06-17 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16600:
---
Parent: HDFS-15382
Issue Type: Sub-task  (was: Bug)

> Deadlock on DataNode
> 
>
> Key: HDFS-16600
> URL: https://issues.apache.org/jira/browse/HDFS-16600
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> The UT 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.testSynchronousEviction 
> failed, because happened deadlock, which  is introduced by 
> [HDFS-16534|https://issues.apache.org/jira/browse/HDFS-16534]. 
> DeadLock:
> {code:java}
> // org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.createRbw line 1588 
> need a read lock
> try (AutoCloseableLock lock = lockManager.readLock(LockLevel.BLOCK_POOl,
> b.getBlockPoolId()))
> // org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.evictBlocks line 
> 3526 need a write lock
> try (AutoCloseableLock lock = lockManager.writeLock(LockLevel.BLOCK_POOl, 
> bpid))
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16600) Fix deadlock of fine-grain lock for FsDatastImpl of DataNode.

2022-06-17 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555638#comment-17555638
 ] 

Xiaoqiao He commented on HDFS-16600:


[~xuzq_zander] BTW, do you deploy this feature on your prod cluster? Would you 
mind offer some performance result versus without this feature if so? Although 
it is deployed on my internal cluster for over year and it works well, I 
believe it could show different performance result more or less for different 
version (my internal version is based branch-2.7). Thanks again.

> Fix deadlock of fine-grain lock for FsDatastImpl of DataNode.
> -
>
> Key: HDFS-16600
> URL: https://issues.apache.org/jira/browse/HDFS-16600
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> The UT 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.testSynchronousEviction 
> failed, because happened deadlock, which  is introduced by 
> [HDFS-16534|https://issues.apache.org/jira/browse/HDFS-16534]. 
> DeadLock:
> {code:java}
> // org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.createRbw line 1588 
> need a read lock
> try (AutoCloseableLock lock = lockManager.readLock(LockLevel.BLOCK_POOl,
> b.getBlockPoolId()))
> // org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.evictBlocks line 
> 3526 need a write lock
> try (AutoCloseableLock lock = lockManager.writeLock(LockLevel.BLOCK_POOl, 
> bpid))
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16600) Fix deadlock of fine-grain lock for FsDatastImpl of DataNode.

2022-06-17 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16600:
---
Summary: Fix deadlock of fine-grain lock for FsDatastImpl of DataNode.  
(was: Deadlock on DataNode)

> Fix deadlock of fine-grain lock for FsDatastImpl of DataNode.
> -
>
> Key: HDFS-16600
> URL: https://issues.apache.org/jira/browse/HDFS-16600
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> The UT 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.testSynchronousEviction 
> failed, because happened deadlock, which  is introduced by 
> [HDFS-16534|https://issues.apache.org/jira/browse/HDFS-16534]. 
> DeadLock:
> {code:java}
> // org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.createRbw line 1588 
> need a read lock
> try (AutoCloseableLock lock = lockManager.readLock(LockLevel.BLOCK_POOl,
> b.getBlockPoolId()))
> // org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.evictBlocks line 
> 3526 need a write lock
> try (AutoCloseableLock lock = lockManager.writeLock(LockLevel.BLOCK_POOl, 
> bpid))
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16598) Fix DataNode FsDatasetImpl lock issue without GS checks.

2022-06-14 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16598:
---
Parent: HDFS-15382
Issue Type: Sub-task  (was: Bug)

> Fix DataNode FsDatasetImpl lock issue without GS checks.
> 
>
> Key: HDFS-16598
> URL: https://issues.apache.org/jira/browse/HDFS-16598
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> org.apache.hadoop.hdfs.testPipelineRecoveryOnRestartFailure failed with the 
> stack like:
> {code:java}
> java.io.IOException: All datanodes 
> [DatanodeInfoWithStorage[127.0.0.1:57448,DS-1b5f7e33-a2bf-4edc-9122-a74c995a99f5,DISK]]
>  are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1667)
>   at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1601)
>   at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1587)
>   at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1371)
>   at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:674)
> {code}
> After tracing the root cause, this bug was introduced by 
> [HDFS-16534|https://issues.apache.org/jira/browse/HDFS-16534]. Because the 
> block GS of client may be smaller than DN when pipeline recovery failed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16598) Fix DataNode FsDatasetImpl lock issue without GS checks.

2022-06-14 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16598.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~xuzq_zander] for your works!

> Fix DataNode FsDatasetImpl lock issue without GS checks.
> 
>
> Key: HDFS-16598
> URL: https://issues.apache.org/jira/browse/HDFS-16598
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> org.apache.hadoop.hdfs.testPipelineRecoveryOnRestartFailure failed with the 
> stack like:
> {code:java}
> java.io.IOException: All datanodes 
> [DatanodeInfoWithStorage[127.0.0.1:57448,DS-1b5f7e33-a2bf-4edc-9122-a74c995a99f5,DISK]]
>  are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1667)
>   at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1601)
>   at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1587)
>   at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1371)
>   at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:674)
> {code}
> After tracing the root cause, this bug was introduced by 
> [HDFS-16534|https://issues.apache.org/jira/browse/HDFS-16534]. Because the 
> block GS of client may be smaller than DN when pipeline recovery failed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16598) Fix DataNode FsDatasetImpl lock issue without GS checks.

2022-06-14 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16598:
---
Summary: Fix DataNode FsDatasetImpl lock issue without GS checks.  (was: 
Fix DataNode Fs)

> Fix DataNode FsDatasetImpl lock issue without GS checks.
> 
>
> Key: HDFS-16598
> URL: https://issues.apache.org/jira/browse/HDFS-16598
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> org.apache.hadoop.hdfs.testPipelineRecoveryOnRestartFailure failed with the 
> stack like:
> {code:java}
> java.io.IOException: All datanodes 
> [DatanodeInfoWithStorage[127.0.0.1:57448,DS-1b5f7e33-a2bf-4edc-9122-a74c995a99f5,DISK]]
>  are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1667)
>   at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1601)
>   at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1587)
>   at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1371)
>   at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:674)
> {code}
> After tracing the root cause, this bug was introduced by 
> [HDFS-16534|https://issues.apache.org/jira/browse/HDFS-16534]. Because the 
> block GS of client may be smaller than DN when pipeline recovery failed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16598) Fix DataNode Fs

2022-06-14 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16598:
---
Summary: Fix DataNode Fs  (was: All datanodes 
[DatanodeInfoWithStorage[127.0.0.1:57448,DS-1b5f7e33-a2bf-4edc-9122-a74c995a99f5,DISK]]
 are bad. Aborting...)

> Fix DataNode Fs
> ---
>
> Key: HDFS-16598
> URL: https://issues.apache.org/jira/browse/HDFS-16598
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> org.apache.hadoop.hdfs.testPipelineRecoveryOnRestartFailure failed with the 
> stack like:
> {code:java}
> java.io.IOException: All datanodes 
> [DatanodeInfoWithStorage[127.0.0.1:57448,DS-1b5f7e33-a2bf-4edc-9122-a74c995a99f5,DISK]]
>  are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1667)
>   at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1601)
>   at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1587)
>   at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1371)
>   at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:674)
> {code}
> After tracing the root cause, this bug was introduced by 
> [HDFS-16534|https://issues.apache.org/jira/browse/HDFS-16534]. Because the 
> block GS of client may be smaller than DN when pipeline recovery failed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16627) Improve BPServiceActor#register log to add NameNode address

2022-06-11 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16627.

Hadoop Flags: Reviewed
  Resolution: Fixed

Committed to trunk. Thanks [~slfan1989] for your contributions.

> Improve BPServiceActor#register log to add NameNode address
> ---
>
> Key: HDFS-16627
> URL: https://issues.apache.org/jira/browse/HDFS-16627
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: fanshilun
>Assignee: fanshilun
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> When I read the log, I think the Addr information of NN should be added to 
> make the log information more complete.
> The log is as follows:
> {code:java}
> 2022-06-06 06:15:32,715 [BP-1990954485-172.17.0.2-1654496132136 heartbeating 
> to localhost/127.0.0.1:42811] INFO  datanode.DataNode 
> (BPServiceActor.java:register(819)) - Block pool 
> BP-1990954485-172.17.0.2-1654496132136 (Datanode Uuid 
> 7d4b5459-6f2b-4203-bf6f-d31bfb9b6c3f) service to localhost/127.0.0.1:42811 
> beginning handshake with NN.
> 2022-06-06 06:15:32,717 [BP-1990954485-172.17.0.2-1654496132136 heartbeating 
> to localhost/127.0.0.1:42811] INFO  datanode.DataNode 
> (BPServiceActor.java:register(847)) - Block pool 
> BP-1990954485-172.17.0.2-1654496132136 (Datanode Uuid 
> 7d4b5459-6f2b-4203-bf6f-d31bfb9b6c3f) service to localhost/127.0.0.1:42811 
> successfully registered with NN. {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16627) Improve BPServiceActor#register log to add NameNode address

2022-06-11 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16627:
---
Summary: Improve BPServiceActor#register log to add NameNode address  (was: 
improve BPServiceActor#register Log Add NN Addr)

> Improve BPServiceActor#register log to add NameNode address
> ---
>
> Key: HDFS-16627
> URL: https://issues.apache.org/jira/browse/HDFS-16627
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: fanshilun
>Assignee: fanshilun
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> When I read the log, I think the Addr information of NN should be added to 
> make the log information more complete.
> The log is as follows:
> {code:java}
> 2022-06-06 06:15:32,715 [BP-1990954485-172.17.0.2-1654496132136 heartbeating 
> to localhost/127.0.0.1:42811] INFO  datanode.DataNode 
> (BPServiceActor.java:register(819)) - Block pool 
> BP-1990954485-172.17.0.2-1654496132136 (Datanode Uuid 
> 7d4b5459-6f2b-4203-bf6f-d31bfb9b6c3f) service to localhost/127.0.0.1:42811 
> beginning handshake with NN.
> 2022-06-06 06:15:32,717 [BP-1990954485-172.17.0.2-1654496132136 heartbeating 
> to localhost/127.0.0.1:42811] INFO  datanode.DataNode 
> (BPServiceActor.java:register(847)) - Block pool 
> BP-1990954485-172.17.0.2-1654496132136 (Datanode Uuid 
> 7d4b5459-6f2b-4203-bf6f-d31bfb9b6c3f) service to localhost/127.0.0.1:42811 
> successfully registered with NN. {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16627) Improve BPServiceActor#register log to add NameNode address

2022-06-11 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17553086#comment-17553086
 ] 

Xiaoqiao He commented on HDFS-16627:


[~slfan1989] BTW, the fix version label should be not set when file jira. 
Generally it should be set after PR check in by committer. Thanks again.

> Improve BPServiceActor#register log to add NameNode address
> ---
>
> Key: HDFS-16627
> URL: https://issues.apache.org/jira/browse/HDFS-16627
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: fanshilun
>Assignee: fanshilun
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> When I read the log, I think the Addr information of NN should be added to 
> make the log information more complete.
> The log is as follows:
> {code:java}
> 2022-06-06 06:15:32,715 [BP-1990954485-172.17.0.2-1654496132136 heartbeating 
> to localhost/127.0.0.1:42811] INFO  datanode.DataNode 
> (BPServiceActor.java:register(819)) - Block pool 
> BP-1990954485-172.17.0.2-1654496132136 (Datanode Uuid 
> 7d4b5459-6f2b-4203-bf6f-d31bfb9b6c3f) service to localhost/127.0.0.1:42811 
> beginning handshake with NN.
> 2022-06-06 06:15:32,717 [BP-1990954485-172.17.0.2-1654496132136 heartbeating 
> to localhost/127.0.0.1:42811] INFO  datanode.DataNode 
> (BPServiceActor.java:register(847)) - Block pool 
> BP-1990954485-172.17.0.2-1654496132136 (Datanode Uuid 
> 7d4b5459-6f2b-4203-bf6f-d31bfb9b6c3f) service to localhost/127.0.0.1:42811 
> successfully registered with NN. {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16609) Fix Flakes Junit Tests that often report timeouts

2022-06-11 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16609.

Hadoop Flags: Reviewed
  Resolution: Fixed

Committed to trunk. Thanks [~slfan1989].

> Fix Flakes Junit Tests that often report timeouts
> -
>
> Key: HDFS-16609
> URL: https://issues.apache.org/jira/browse/HDFS-16609
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: fanshilun
>Assignee: fanshilun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> When I was dealing with HDFS-16590 JIRA, Junit Tests often reported errors, I 
> found that one type of problem is TimeOut problem, these problems can be 
> avoided by adjusting TimeOut time.
> The modified method is as follows:
> 1.org.apache.hadoop.hdfs.TestFileCreation#testServerDefaultsWithMinimalCaching
> {code:java}
> [ERROR] 
> testServerDefaultsWithMinimalCaching(org.apache.hadoop.hdfs.TestFileCreation) 
>  Time elapsed: 7.136 s  <<< ERROR!
> java.util.concurrent.TimeoutException: 
> Timed out waiting for condition. 
> Thread diagnostics: 
> [WARNING] 
> org.apache.hadoop.hdfs.TestFileCreation.testServerDefaultsWithMinimalCaching(org.apache.hadoop.hdfs.TestFileCreation)
> [ERROR]   Run 1: TestFileCreation.testServerDefaultsWithMinimalCaching:277 
> Timeout Timed out ...
> [INFO]   Run 2: PASS{code}
> 2.org.apache.hadoop.hdfs.TestDFSShell#testFilePermissions
> {code:java}
> [ERROR] testFilePermissions(org.apache.hadoop.hdfs.TestDFSShell)  Time 
> elapsed: 30.022 s  <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 3 
> milliseconds
>   at java.lang.Thread.dumpThreads(Native Method)
>   at java.lang.Thread.getStackTrace(Thread.java:1549)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout.createTimeoutException(FailOnTimeout.java:182)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout.getResult(FailOnTimeout.java:177)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout.evaluate(FailOnTimeout.java:128)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.lang.Thread.run(Thread.java:748)
> [WARNING] 
> org.apache.hadoop.hdfs.TestDFSShell.testFilePermissions(org.apache.hadoop.hdfs.TestDFSShell)
> [ERROR]   Run 1: TestDFSShell.testFilePermissions TestTimedOut test timed out 
> after 3 mil...
> [INFO]   Run 2: PASS {code}
> 3.org.apache.hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier#testSPSWhenFileHasExcessRedundancyBlocks
> {code:java}
> [ERROR] 
> testSPSWhenFileHasExcessRedundancyBlocks(org.apache.hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier)
>   Time elapsed: 67.904 s  <<< ERROR!
> java.util.concurrent.TimeoutException: 
> Timed out waiting for condition. 
> [WARNING] 
> org.apache.hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier.testSPSWhenFileHasExcessRedundancyBlocks(org.apache.hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier)
> [ERROR]   Run 1: 
> TestExternalStoragePolicySatisfier.testSPSWhenFileHasExcessRedundancyBlocks:1379
>  Timeout
> [ERROR]   Run 2: 
> TestExternalStoragePolicySatisfier.testSPSWhenFileHasExcessRedundancyBlocks:1379
>  Timeout
> [INFO]   Run 3: PASS {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16588) Backport HDFS-16584 to branch-3.3.

2022-05-24 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16588:
---
Summary: Backport HDFS-16584 to branch-3.3.  (was: Backport HDFS-16584 to 
branch-3.3 and other active old branches)

> Backport HDFS-16584 to branch-3.3.
> --
>
> Key: HDFS-16588
> URL: https://issues.apache.org/jira/browse/HDFS-16588
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, namenode
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> This issue has been dealt with in trunk and again needs to be backported to 
> branch-3.3 or another active branch.
> See HDFS-16584.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16588) Backport HDFS-16584 to branch-3.3.

2022-05-24 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16588.

Fix Version/s: 3.3.4
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to branch-3.3

> Backport HDFS-16584 to branch-3.3.
> --
>
> Key: HDFS-16588
> URL: https://issues.apache.org/jira/browse/HDFS-16588
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, namenode
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.4
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> This issue has been dealt with in trunk and again needs to be backported to 
> branch-3.3 or another active branch.
> See HDFS-16584.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16584) Record StandbyNameNode information when Balancer is running

2022-05-22 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16584.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~jianghuazhu] for your contribution!

> Record StandbyNameNode information when Balancer is running
> ---
>
> Key: HDFS-16584
> URL: https://issues.apache.org/jira/browse/HDFS-16584
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, namenode
>Affects Versions: 3.3.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: image-2022-05-19-20-23-23-825.png
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> When the Balancer is running, we allow block data to be fetched from the 
> StandbyNameNode, which is nice. Here are some logs:
>  !image-2022-05-19-20-23-23-825.png! 
> But we have no way of knowing which NameNode the request was made to. We 
> should log more detailed information, such as the host associated with the 
> StandbyNameNode.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16657) Changing pool-level lock to volume-level lock for invalidation of blocks

2022-07-12 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566132#comment-17566132
 ] 

Xiaoqiao He commented on HDFS-16657:


[~yuanbo] Thanks for your proposal. IIRC, we has discussed this issue for a 
while. I think it is time to improve it.
For the furthermore improvement, we should consider acquire volume-level lock 
cost, such as process command to invalidate blocks, it is possible to batch 
them and reduce acquire lock frequently. I am not sure if other cases to block 
heartbeat and some other flow.
Anyway, would you like to contribute and improve it? Thanks again.

> Changing pool-level lock to volume-level lock for invalidation of blocks
> 
>
> Key: HDFS-16657
> URL: https://issues.apache.org/jira/browse/HDFS-16657
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Yuanbo Liu
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2022-07-13-10-25-37-383.png, 
> image-2022-07-13-10-27-01-386.png, image-2022-07-13-10-27-44-258.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Recently we see that the heartbeating of dn become slow in a very busy 
> cluster, here is the chart:
> !image-2022-07-13-10-25-37-383.png|width=665,height=245!
>  
> After getting jstack of the dn, we find that dn heartbeat stuck in 
> invalidation of blocks:
> !image-2022-07-13-10-27-01-386.png|width=658,height=308!
> !image-2022-07-13-10-27-44-258.png|width=502,height=325!
> The key code is:
> {code:java}
> // code placeholder
> try {
>   File blockFile = new File(info.getBlockURI());
>   if (blockFile != null && blockFile.getParentFile() == null) {
> errors.add("Failed to delete replica " + invalidBlks[i]
> +  ". Parent not found for block file: " + blockFile);
> continue;
>   }
> } catch(IllegalArgumentException e) {
>   LOG.warn("Parent directory check failed; replica " + info
>   + " is not backed by a local file");
> } {code}
> DN is trying to locate parent path of block file, thus there is a disk I/O in 
> pool-level lock. When the disk becomes very busy with high io wait, All the 
> pending threads will be blocked by the pool-level lock, and the time of 
> heartbeat is high. We proposal to change the pool-level lock to volume-level 
> lock for block invalidation
> cc: [~hexiaoqiao] [~Aiphag0] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16655) OIV: print out erasure coding policy name in oiv Delimited output

2022-07-25 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16655.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~max2049] for your contributions.

> OIV: print out erasure coding policy name in oiv Delimited output
> -
>
> Key: HDFS-16655
> URL: https://issues.apache.org/jira/browse/HDFS-16655
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: tools
>Affects Versions: 3.4.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> By adding erasure coding policy name to oiv output, it will help with oiv 
> post-analysis to have a overview of all folders/files with specified ec 
> policy and to apply internal regulation based on this information. In 
> particular, it wiil be convenient for the platform to calculate the real 
> storage size of the ec file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16655) OIV: print out erasure coding policy name in oiv Delimited output

2022-07-25 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16655:
---
Component/s: erasure-coding
 (was: tools)

> OIV: print out erasure coding policy name in oiv Delimited output
> -
>
> Key: HDFS-16655
> URL: https://issues.apache.org/jira/browse/HDFS-16655
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: erasure-coding
>Affects Versions: 3.4.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> By adding erasure coding policy name to oiv output, it will help with oiv 
> post-analysis to have a overview of all folders/files with specified ec 
> policy and to apply internal regulation based on this information. In 
> particular, it wiil be convenient for the platform to calculate the real 
> storage size of the ec file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16658) BlockManager should output some logs when logEveryBlock is true.

2022-07-28 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16658.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~xuzq_zander].

> BlockManager should output some logs when logEveryBlock is true.
> 
>
> Key: HDFS-16658
> URL: https://issues.apache.org/jira/browse/HDFS-16658
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> During locating some abnormal cases in our prod environment, I found that 
> BlockManager does not out put some logs in `addStoredBlock` even though 
> `logEveryBlock` is true.
> I feel that we need to change the log level from DEBUG to INFO.
> {code:java}
> // Some comments here
> private Block addStoredBlock(final BlockInfo block,
>final Block reportedBlock,
>DatanodeStorageInfo storageInfo,
>DatanodeDescriptor delNodeHint,
>boolean logEveryBlock)
>   throws IOException {
> 
>   if (logEveryBlock) {
> blockLog.debug("BLOCK* addStoredBlock: {} is added to {} (size={})",
> node, storedBlock, storedBlock.getNumBytes());
>   }
> ...
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16534) Split datanode block pool locks to volume grain.

2022-04-17 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16534.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~Aiphag0] for your works.

> Split datanode block pool locks to volume grain.
> 
>
> Key: HDFS-16534
> URL: https://issues.apache.org/jira/browse/HDFS-16534
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Mingxiang Li
>Assignee: Mingxiang Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
>  This is sub task of HDFS-15382. 
> https://issues.apache.org/jira/browse/HDFS-15180 have split lock to block 
> pool grain and do some prepare.This pr is the last part of volume lock.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15382) Split one FsDatasetImpl lock to volume grain locks.

2022-04-17 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15382:
---
Release Note: Throughput is one of the core performance evaluation for 
DataNode instance. However it does not reach the best performance especially 
for Federation deploy all the time although there are different improvement, 
because of the global coarse-grain lock. These series issues (include 
HDFS-16534, HDFS-16511, HDFS-15382 and HDFS-16429.) try to split the global 
coarse-grain lock to fine-grain lock which is double level lock for blockpool 
and volume, to improve the throughput and avoid lock impacts between blockpools 
and volumes.

> Split one FsDatasetImpl lock to volume grain locks.
> ---
>
> Key: HDFS-15382
> URL: https://issues.apache.org/jira/browse/HDFS-15382
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Mingxiang Li
>Assignee: Mingxiang Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: HDFS-15382-sample.patch, image-2020-06-02-1.png, 
> image-2020-06-03-1.png
>
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> In HDFS-15180 we split lock to blockpool grain size.But when one volume is in 
> heavy load and will block other request which in same blockpool but different 
> volume.So we split lock to two leval to avoid this happend.And to improve 
> datanode performance.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15382) Split one FsDatasetImpl lock to volume grain locks.

2022-04-17 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1752#comment-1752
 ] 

Xiaoqiao He commented on HDFS-15382:


This big improvement has been ready completely now. Thanks [~Aiphag0] for your 
great works and thanks every guys (too numerous to mention) to share your good 
idea and warm discussions. I have seen that this improvement has been deployed 
at different corporations and the result is expected. Welcome any feedback if 
meet any issues. Thanks all again.

> Split one FsDatasetImpl lock to volume grain locks.
> ---
>
> Key: HDFS-15382
> URL: https://issues.apache.org/jira/browse/HDFS-15382
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Mingxiang Li
>Assignee: Mingxiang Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: HDFS-15382-sample.patch, image-2020-06-02-1.png, 
> image-2020-06-03-1.png
>
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> In HDFS-15180 we split lock to blockpool grain size.But when one volume is in 
> heavy load and will block other request which in same blockpool but different 
> volume.So we split lock to two leval to avoid this happend.And to improve 
> datanode performance.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16531) Avoid setReplication logging an edit record if old replication equals the new value

2022-04-17 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16531.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~sodonnell] for your contributions.

> Avoid setReplication logging an edit record if old replication equals the new 
> value
> ---
>
> Key: HDFS-16531
> URL: https://issues.apache.org/jira/browse/HDFS-16531
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> I recently came across a NN log where about 800k setRep calls were made, 
> setting the replication from 3 to 3 - ie leaving it unchanged.
> Even in a case like this, we log an edit record, an audit log, and perform 
> some quota checks etc.
> I believe it should be possible to avoid some of the work if we check for 
> oldRep == newRep and jump out of the method early.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16509) Fix decommission UnsupportedOperationException: Remove unsupported

2022-04-13 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16509.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~cndaimin] for your contributions.

> Fix decommission UnsupportedOperationException: Remove unsupported
> --
>
> Key: HDFS-16509
> URL: https://issues.apache.org/jira/browse/HDFS-16509
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.1, 3.3.2
>Reporter: daimin
>Assignee: daimin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> We encountered an "UnsupportedOperationException: Remove unsupported" error 
> when some datanodes were in decommission. The reason of the exception is that 
> datanode.getBlockIterator() returns an Iterator does not support remove, 
> however DatanodeAdminDefaultMonitor#processBlocksInternal invokes it.remove() 
> when a block not found, e.g, the file containing the block is deleted.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16569) Consider attaching block location info from client when closing a completed file

2022-05-06 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532747#comment-17532747
 ] 

Xiaoqiao He commented on HDFS-16569:


[~yuanbo] Thanks for your report. Do you mean that last block size is 0 and no 
anymore ibr but namenode wait util client timeout?

> Consider attaching block location info from client when closing a completed 
> file
> 
>
> Key: HDFS-16569
> URL: https://issues.apache.org/jira/browse/HDFS-16569
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Yuanbo Liu
>Priority: Major
>
> when a file is finished, client will not close it until DNs send 
> RECEIVED_BLOCK by ibr or client is timeout. we can always see such kind of 
> log in namenode
> {code:java}
> is COMMITTED but not COMPLETE(numNodes= 0 <  minimum = 1) in file{code}
> Since client already has the last block locations, it's not necessary to rely 
> on the ibr from DN when closing file.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16565) DataNode holds a large number of CLOSE_WAIT connections that are not released

2022-05-06 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532755#comment-17532755
 ] 

Xiaoqiao He commented on HDFS-16565:


[~jianghuazhu] Thanks for your report. Is it possible that application open 
streaming but not close it ever? Did you try to dig who occupy the port of 
source side?

> DataNode holds a large number of CLOSE_WAIT connections that are not released
> -
>
> Key: HDFS-16565
> URL: https://issues.apache.org/jira/browse/HDFS-16565
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.0
> Environment: CentOS Linux release 7.5.1804 (Core)
>Reporter: JiangHua Zhu
>Priority: Major
> Attachments: screenshot-1.png
>
>
> There is a strange phenomenon here, DataNode holds a large number of 
> connections in CLOSE_WAIT state and does not release.
> netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
> LISTEN 20
> CLOSE_WAIT 17707
> ESTABLISHED 1450
> TIME_WAIT 12
> It can be found that the connections with the CLOSE_WAIT state have reached 
> 17k and are still growing. View these CLOSE_WAITs through the lsof command, 
> and get the following phenomenon:
> lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND'
>  !screenshot-1.png! 
> It can be seen that the reason for this phenomenon is that Socket#close() is 
> not called correctly, and DataNode interacts with other nodes as Client.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16569) Consider attaching block location info from client when closing a completed file

2022-05-06 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532790#comment-17532790
 ] 

Xiaoqiao He commented on HDFS-16569:


Thanks [~yuanbo] for your detailed explanation. 
It is an interesting proposal. IIRC, there are some others DFS has implemented 
as your proposal, such as 
Tectonic(https://www.usenix.org/conference/fast21/presentation/pan). 
I believe it will improve performance for write flow. I am sure if there are 
some historical reasons or some security issues consideration when following 
above proposal.
{quote}You mean to say, since the client has already reported the block 
location, no need to wait for the datanode to report it again, right? So, the 
answer is No, that is per design, the namenode doesn't blindly trust a client, 
and there are other reasons as well. So, it requires at least one datanode to 
confirm that it has received the block with same amount of data as the client 
claims.{quote}
[~ayushtkn] Thanks Ayush to give one point that NameNode could not trust client 
hundred percent. +1 for me, this is basic of the original HDFS design. And 
another one, it could introduce security issue, such as one fake client report 
random information to NameNode could cause the following read failed and mount 
of missing blocks at NameNode side seen. I am concern if we could tradeoff 
initial assumptions and try to find the best solution. Anyway, it is valuable 
to have more discussion.
Thanks [~yuanbo] and [~ayushtkn].

> Consider attaching block location info from client when closing a completed 
> file
> 
>
> Key: HDFS-16569
> URL: https://issues.apache.org/jira/browse/HDFS-16569
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Yuanbo Liu
>Priority: Major
>
> when a file is finished, client will not close it until DNs send 
> RECEIVED_BLOCK by ibr or client is timeout. we can always see such kind of 
> log in namenode
> {code:java}
> is COMMITTED but not COMPLETE(numNodes= 0 <  minimum = 1) in file{code}
> Since client already has the last block locations, it's not necessary to rely 
> on the ibr from DN when closing file.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-16569) Consider attaching block location info from client when closing a completed file

2022-05-06 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532790#comment-17532790
 ] 

Xiaoqiao He edited comment on HDFS-16569 at 5/6/22 11:22 AM:
-

Thanks [~yuanbo] for your detailed explanation. 
It is an interesting proposal. IIRC, there are some others DFS has implemented 
as your proposal, such as 
Tectonic(https://www.usenix.org/conference/fast21/presentation/pan). 
I believe it will improve performance for write flow. I am not sure if there 
are some historical reasons or some security issues consideration when 
following above proposal.
{quote}You mean to say, since the client has already reported the block 
location, no need to wait for the datanode to report it again, right? So, the 
answer is No, that is per design, the namenode doesn't blindly trust a client, 
and there are other reasons as well. So, it requires at least one datanode to 
confirm that it has received the block with same amount of data as the client 
claims.{quote}
[~ayushtkn] Thanks Ayush to give one point that NameNode could not trust client 
hundred percent. +1 for me, this is basic of the original HDFS design. And 
another one, it could introduce security issue, such as one fake client report 
random information to NameNode could cause the following read failed and mount 
of missing blocks at NameNode side seen. I am concern if we could tradeoff 
initial assumptions and try to find the best solution. Anyway, it is valuable 
to have more discussion.
Thanks [~yuanbo] and [~ayushtkn].


was (Author: hexiaoqiao):
Thanks [~yuanbo] for your detailed explanation. 
It is an interesting proposal. IIRC, there are some others DFS has implemented 
as your proposal, such as 
Tectonic(https://www.usenix.org/conference/fast21/presentation/pan). 
I believe it will improve performance for write flow. I am sure if there are 
some historical reasons or some security issues consideration when following 
above proposal.
{quote}You mean to say, since the client has already reported the block 
location, no need to wait for the datanode to report it again, right? So, the 
answer is No, that is per design, the namenode doesn't blindly trust a client, 
and there are other reasons as well. So, it requires at least one datanode to 
confirm that it has received the block with same amount of data as the client 
claims.{quote}
[~ayushtkn] Thanks Ayush to give one point that NameNode could not trust client 
hundred percent. +1 for me, this is basic of the original HDFS design. And 
another one, it could introduce security issue, such as one fake client report 
random information to NameNode could cause the following read failed and mount 
of missing blocks at NameNode side seen. I am concern if we could tradeoff 
initial assumptions and try to find the best solution. Anyway, it is valuable 
to have more discussion.
Thanks [~yuanbo] and [~ayushtkn].

> Consider attaching block location info from client when closing a completed 
> file
> 
>
> Key: HDFS-16569
> URL: https://issues.apache.org/jira/browse/HDFS-16569
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Yuanbo Liu
>Priority: Major
>
> when a file is finished, client will not close it until DNs send 
> RECEIVED_BLOCK by ibr or client is timeout. we can always see such kind of 
> log in namenode
> {code:java}
> is COMMITTED but not COMPLETE(numNodes= 0 <  minimum = 1) in file{code}
> Since client already has the last block locations, it's not necessary to rely 
> on the ibr from DN when closing file.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16553) Fix checkstyle for the length of BlockManager construction method over limit.

2022-04-29 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16553:
---
Component/s: namenode

> Fix checkstyle for the length of BlockManager construction method over limit.
> -
>
> Key: HDFS-16553
> URL: https://issues.apache.org/jira/browse/HDFS-16553
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Chengwei Wang
>Assignee: Chengwei Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The length  of BlockManager construction method is 156 lines which is over 
> 150 limit for BlockManager, do refactor the method to fix the checkstyle.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16553) Fix checkstyle for the length of BlockManager construction method over limit.

2022-04-29 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16553.

   Fix Version/s: 3.4.0
Hadoop Flags: Reviewed
Target Version/s:   (was: 3.4.0)
  Resolution: Fixed

Committed to trunk. Thanks [~smarthan] for your contributions.

> Fix checkstyle for the length of BlockManager construction method over limit.
> -
>
> Key: HDFS-16553
> URL: https://issues.apache.org/jira/browse/HDFS-16553
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chengwei Wang
>Assignee: Chengwei Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The length  of BlockManager construction method is 156 lines which is over 
> 150 limit for BlockManager, do refactor the method to fix the checkstyle.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16569) Consider attaching block location info from client when closing a completed file

2022-05-10 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17534276#comment-17534276
 ] 

Xiaoqiao He commented on HDFS-16569:


{quote}Regarding Tectonic, seperating deamon threads from NN and isolating DN 
from NN are most attracting features. Maybe we can design and implement those 
features on hadoop-4 ?{quote}
Hmmm, It is very similar to Ozone, and will be a big architecture refactor.
Maybe we could start with only one nit feature (of course it need to discuss 
and reach agreement first.), such as report location from client only as above 
mentioned.

> Consider attaching block location info from client when closing a completed 
> file
> 
>
> Key: HDFS-16569
> URL: https://issues.apache.org/jira/browse/HDFS-16569
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Yuanbo Liu
>Priority: Major
>
> when a file is finished, client will not close it until DNs send 
> RECEIVED_BLOCK by ibr or client is timeout. we can always see such kind of 
> log in namenode
> {code:java}
> is COMMITTED but not COMPLETE(numNodes= 0 <  minimum = 1) in file{code}
> Since client already has the last block locations, it's not necessary to rely 
> on the ibr from DN when closing file.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16717) Replace NPE with IOException in DataNode.class

2022-08-23 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16717.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~xuzq_zander]!

> Replace NPE with IOException in DataNode.class
> --
>
> Key: HDFS-16717
> URL: https://issues.apache.org/jira/browse/HDFS-16717
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In current logic, if storage not yet initialized, it will throw a NPE in 
> DataNode.class. Developers or SREs are very sensitive to NPE, so I feel that 
> we can use IOException instead of NPE when storage not yet initialized.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16735) Reduce the number of HeartbeatManager loops

2022-08-28 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16735.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~zhangshuyan] for your contributions!

> Reduce the number of HeartbeatManager loops
> ---
>
> Key: HDFS-16735
> URL: https://issues.apache.org/jira/browse/HDFS-16735
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> HeartbeatManager only processes one dead datanode (and failed storage) per 
> round in heartbeatCheck(), that is to say, if there are ten failed storages, 
> all datanode states need to be scanned 10 times, which is unnecessary and a 
> waste of resources. This patch makes the number of bad storages processed per 
> scan configurable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10453) ReplicationMonitor thread could stuck for long time due to the race between replication and delete of same file in a large cluster.

2022-10-26 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-10453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17624847#comment-17624847
 ] 

Xiaoqiao He commented on HDFS-10453:


[~yuyanlei] It works find in my internal cluster. It has been checkin trunk and 
other active branches. Thanks.

> ReplicationMonitor thread could stuck for long time due to the race between 
> replication and delete of same file in a large cluster.
> ---
>
> Key: HDFS-10453
> URL: https://issues.apache.org/jira/browse/HDFS-10453
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.4.1, 2.5.2, 2.7.1, 2.6.4
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.1.0, 2.10.0, 2.9.1, 2.8.4, 2.7.6, 3.0.3
>
> Attachments: HDFS-10453-branch-2.001.patch, 
> HDFS-10453-branch-2.003.patch, HDFS-10453-branch-2.7.004.patch, 
> HDFS-10453-branch-2.7.005.patch, HDFS-10453-branch-2.7.006.patch, 
> HDFS-10453-branch-2.7.007.patch, HDFS-10453-branch-2.7.008.patch, 
> HDFS-10453-branch-2.7.009.patch, HDFS-10453-branch-2.8.001.patch, 
> HDFS-10453-branch-2.8.002.patch, HDFS-10453-branch-2.9.001.patch, 
> HDFS-10453-branch-2.9.002.patch, HDFS-10453-branch-3.0.001.patch, 
> HDFS-10453-branch-3.0.002.patch, HDFS-10453-trunk.001.patch, 
> HDFS-10453-trunk.002.patch, HDFS-10453.001.patch
>
>
> ReplicationMonitor thread could stuck for long time and loss data with little 
> probability. Consider the typical scenario:
> (1) create and close a file with the default replicas(3);
> (2) increase replication (to 10) of the file.
> (3) delete the file while ReplicationMonitor is scheduling blocks belong to 
> that file for replications.
> if ReplicationMonitor stuck reappeared, NameNode will print log as:
> {code:xml}
> 2016-04-19 10:20:48,083 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to 
> place enough replicas, still in need of 7 to reach 10 
> (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, 
> storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, 
> newBlock=false) For more information, please enable DEBUG log level on 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
> ..
> 2016-04-19 10:21:17,184 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to 
> place enough replicas, still in need of 7 to reach 10 
> (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, 
> storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, 
> newBlock=false) For more information, please enable DEBUG log level on 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
> 2016-04-19 10:21:17,184 WARN 
> org.apache.hadoop.hdfs.protocol.BlockStoragePolicy: Failed to place enough 
> replicas: expected size is 7 but only 0 storage types can be selected 
> (replication=10, selected=[], unavailable=[DISK, ARCHIVE], removed=[DISK, 
> DISK, DISK, DISK, DISK, DISK, DISK], policy=BlockStoragePolicy{HOT:7, 
> storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
> 2016-04-19 10:21:17,184 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to 
> place enough replicas, still in need of 7 to reach 10 
> (unavailableStorages=[DISK, ARCHIVE], storagePolicy=BlockStoragePolicy{HOT:7, 
> storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, 
> newBlock=false) All required storage types are unavailable:  
> unavailableStorages=[DISK, ARCHIVE], storagePolicy=BlockStoragePolicy{HOT:7, 
> storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
> {code}
> This is because 2 threads (#NameNodeRpcServer and #ReplicationMonitor) 
> process same block at the same moment.
> (1) ReplicationMonitor#computeReplicationWorkForBlocks get blocks to 
> replicate and leave the global lock.
> (2) FSNamesystem#delete invoked to delete blocks then clear the reference in 
> blocksmap, needReplications, etc. the block's NumBytes will set 
> NO_ACK(Long.MAX_VALUE) which is used to indicate that the block deletion does 
> not need explicit ACK from the node. 
> (3) ReplicationMonitor#computeReplicationWorkForBlocks continue to 
> chooseTargets for the same blocks and no node will be selected after traverse 
> whole cluster because  no node choice satisfy the goodness criteria 
> (remaining spaces achieve required size Long.MAX_VALUE). 
> During of stage#3 ReplicationMonitor stuck for long time, especial in a large 
> cluster. invalidateBlocks & neededReplications continues to grow and no 
> consumes. it will loss data at the worst.
> This can mostly be avoided by skip chooseTarget for 

[jira] [Updated] (HDFS-16781) setReplication方法不能设置整个目录的副本吗?whole folder?

2022-09-25 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16781:
---
Priority: Minor  (was: Blocker)

> setReplication方法不能设置整个目录的副本吗?whole folder?
> --
>
> Key: HDFS-16781
> URL: https://issues.apache.org/jira/browse/HDFS-16781
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfs
>Affects Versions: 2.6.0
> Environment: org.apache.hadoop.fs.FileSystem
>Reporter: zdl
>Priority: Minor
>
> org.apache.hadoop.fs.FileSystem里有个setReplication方法可以设置目标文件的副本数,但是只能指定到文件,如果指定目录,就不生效。
> 我希望指定目录时,整个目录下所有文件和子目录下的文件都能设置副本,实际上在命令行中用hadoop fs -setrep /dirs 
> 方法是可以实现这个目标的,为什么在这个java api里面反倒不行?
> 或者是说,这个问题是否已经在某个新版本里解决了?请指点!谢谢。
> I want to set replications for whole folder's files by 
> org.apache.hadoop.fs.FileSystem setReplication function, while now in v2.6.0 
> it can only set to a file.
> Is there any solutions or a later version can solve it ?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



<    4   5   6   7   8   9   10   11   >