[jira] [Resolved] (HDFS-16194) Simplify the code with DatanodeID#getXferAddrWithHostname

2021-09-04 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16194.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~tomscut] for your contributions!

> Simplify the code with DatanodeID#getXferAddrWithHostname   
> 
>
> Key: HDFS-16194
> URL: https://issues.apache.org/jira/browse/HDFS-16194
> Project: Hadoop HDFS
>  Issue Type: Wish
>Reporter: tomscut
>Assignee: tomscut
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Simplify the code with DatanodeID#getXferAddrWithHostname.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16138) BlockReportProcessingThread exit doesn't print the actual stack

2021-09-05 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16138:
---
Summary: BlockReportProcessingThread exit doesn't print the actual stack  
(was: BlockReportProcessingThread exit doesnt print the acutal stack)

> BlockReportProcessingThread exit doesn't print the actual stack
> ---
>
> Key: HDFS-16138
> URL: https://issues.apache.org/jira/browse/HDFS-16138
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Renukaprasad C
>Assignee: Renukaprasad C
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> BlockReportProcessingThread thread may gets exited with multiple reasons, but 
> the current logging prints only the exception message with different stack 
> which is difficult to debug the issue.
>  
> Existing logging:
> 2021-07-20 10:20:23,104 [Block report processor] INFO  util.ExitUtil 
> (ExitUtil.java:terminate(210)) - Exiting with status 1: Block report 
> processor encountered fatal exception: java.lang.AssertionError
> 2021-07-20 10:20:23,104 [Block report processor] ERROR util.ExitUtil 
> (ExitUtil.java:terminate(213)) - Terminate called
> 1: Block report processor encountered fatal exception: 
> java.lang.AssertionError
>     at 
> org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:304)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockReportProcessingThread.run(BlockManager.java:5315)
> Exception in thread "Block report processor" 1: Block report processor 
> encountered fatal exception: java.lang.AssertionError
>     at 
> org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:304)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockReportProcessingThread.run(BlockManager.java:5315)
>  
> Actual issue found at:
> 2021-07-20 10:20:23,101 [Block report processor] ERROR 
> blockmanagement.BlockManager (BlockManager.java:run(5314)) - 
> java.lang.AssertionError
> java.lang.AssertionError
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.addStoredBlock(BlockManager.java:3480)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processAndHandleReportedBlock(BlockManager.java:4280)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.addBlock(BlockManager.java:4202)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processIncrementalBlockReport(BlockManager.java:4338)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processIncrementalBlockReport(BlockManager.java:4305)
>     at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.processIncrementalBlockReport(FSNamesystem.java:4853)
>     at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer$2.run(NameNodeRpcServer.java:1657)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockReportProcessingThread.processQueue(BlockManager.java:5334)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockReportProcessingThread.run(BlockManager.java:5312)
>  
> This issue found while working on FGL branch. But, same issue can happen in 
> Trunk also in any error scenario.
>  
> [~hemanthboyina] [~hexiaoqiao]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16138) BlockReportProcessingThread exit doesn't print the actual stack

2021-09-05 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16138:
---
Issue Type: Improvement  (was: Bug)

> BlockReportProcessingThread exit doesn't print the actual stack
> ---
>
> Key: HDFS-16138
> URL: https://issues.apache.org/jira/browse/HDFS-16138
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Renukaprasad C
>Assignee: Renukaprasad C
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> BlockReportProcessingThread thread may gets exited with multiple reasons, but 
> the current logging prints only the exception message with different stack 
> which is difficult to debug the issue.
>  
> Existing logging:
> 2021-07-20 10:20:23,104 [Block report processor] INFO  util.ExitUtil 
> (ExitUtil.java:terminate(210)) - Exiting with status 1: Block report 
> processor encountered fatal exception: java.lang.AssertionError
> 2021-07-20 10:20:23,104 [Block report processor] ERROR util.ExitUtil 
> (ExitUtil.java:terminate(213)) - Terminate called
> 1: Block report processor encountered fatal exception: 
> java.lang.AssertionError
>     at 
> org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:304)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockReportProcessingThread.run(BlockManager.java:5315)
> Exception in thread "Block report processor" 1: Block report processor 
> encountered fatal exception: java.lang.AssertionError
>     at 
> org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:304)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockReportProcessingThread.run(BlockManager.java:5315)
>  
> Actual issue found at:
> 2021-07-20 10:20:23,101 [Block report processor] ERROR 
> blockmanagement.BlockManager (BlockManager.java:run(5314)) - 
> java.lang.AssertionError
> java.lang.AssertionError
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.addStoredBlock(BlockManager.java:3480)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processAndHandleReportedBlock(BlockManager.java:4280)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.addBlock(BlockManager.java:4202)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processIncrementalBlockReport(BlockManager.java:4338)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processIncrementalBlockReport(BlockManager.java:4305)
>     at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.processIncrementalBlockReport(FSNamesystem.java:4853)
>     at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer$2.run(NameNodeRpcServer.java:1657)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockReportProcessingThread.processQueue(BlockManager.java:5334)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockReportProcessingThread.run(BlockManager.java:5312)
>  
> This issue found while working on FGL branch. But, same issue can happen in 
> Trunk also in any error scenario.
>  
> [~hemanthboyina] [~hexiaoqiao]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16138) BlockReportProcessingThread exit doesn't print the actual stack

2021-09-05 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16138.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Thanks [~prasad-acit] for your works. Committed to trunk.

> BlockReportProcessingThread exit doesn't print the actual stack
> ---
>
> Key: HDFS-16138
> URL: https://issues.apache.org/jira/browse/HDFS-16138
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Renukaprasad C
>Assignee: Renukaprasad C
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> BlockReportProcessingThread thread may gets exited with multiple reasons, but 
> the current logging prints only the exception message with different stack 
> which is difficult to debug the issue.
>  
> Existing logging:
> 2021-07-20 10:20:23,104 [Block report processor] INFO  util.ExitUtil 
> (ExitUtil.java:terminate(210)) - Exiting with status 1: Block report 
> processor encountered fatal exception: java.lang.AssertionError
> 2021-07-20 10:20:23,104 [Block report processor] ERROR util.ExitUtil 
> (ExitUtil.java:terminate(213)) - Terminate called
> 1: Block report processor encountered fatal exception: 
> java.lang.AssertionError
>     at 
> org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:304)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockReportProcessingThread.run(BlockManager.java:5315)
> Exception in thread "Block report processor" 1: Block report processor 
> encountered fatal exception: java.lang.AssertionError
>     at 
> org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:304)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockReportProcessingThread.run(BlockManager.java:5315)
>  
> Actual issue found at:
> 2021-07-20 10:20:23,101 [Block report processor] ERROR 
> blockmanagement.BlockManager (BlockManager.java:run(5314)) - 
> java.lang.AssertionError
> java.lang.AssertionError
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.addStoredBlock(BlockManager.java:3480)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processAndHandleReportedBlock(BlockManager.java:4280)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.addBlock(BlockManager.java:4202)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processIncrementalBlockReport(BlockManager.java:4338)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processIncrementalBlockReport(BlockManager.java:4305)
>     at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.processIncrementalBlockReport(FSNamesystem.java:4853)
>     at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer$2.run(NameNodeRpcServer.java:1657)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockReportProcessingThread.processQueue(BlockManager.java:5334)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockReportProcessingThread.run(BlockManager.java:5312)
>  
> This issue found while working on FGL branch. But, same issue can happen in 
> Trunk also in any error scenario.
>  
> [~hemanthboyina] [~hexiaoqiao]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16044) Fix getListing call getLocatedBlocks even source is a directory

2021-09-05 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16044:
---
   Attachment: HDFS-16044.05.patch
 Hadoop Flags:   (was: Reviewed)
Affects Version/s: (was: 3.1.1)
   Status: Patch Available  (was: Reopened)

Re-submit patch [^HDFS-16044.05.patch] and try to trigger yetus again.

> Fix getListing call getLocatedBlocks even source is a directory
> ---
>
> Key: HDFS-16044
> URL: https://issues.apache.org/jira/browse/HDFS-16044
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: ludun
>Assignee: ludun
>Priority: Major
> Attachments: HDFS-16044.00.patch, HDFS-16044.01.patch, 
> HDFS-16044.02.patch, HDFS-16044.03.patch, HDFS-16044.05.patch
>
>
> In production cluster when call getListing very frequent.  The processing 
> time of rpc request is very high. we try  to  optimize the performance of 
> getListing request.
> After some check, we found that, even the source and child is dir,   the 
> getListing request also call   getLocatedBlocks. 
> the request is and  needLocation is false
> {code:java}
> 2021-05-27 15:16:07,093 TRACE ipc.ProtobufRpcEngine: 1: Call -> 
> 8-5-231-4/8.5.231.4:25000: getListing {src: 
> "/data/connector/test/topics/102test" startAfter: "" needLocation: false}
> {code}
> but getListing request 1000 times getLocatedBlocks which not needed.
> {code:java}
> `---ts=2021-05-27 14:19:15;thread_name=IPC Server handler 86 on 
> 25000;id=e6;is_daemon=true;priority=5;TCCL=sun.misc.Launcher$AppClassLoader@5fcfe4b2
> `---[35.068532ms] 
> org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp:getListing()
> +---[0.003542ms] 
> org.apache.hadoop.hdfs.server.namenode.INodesInPath:getPathComponents() #214
> +---[0.003053ms] 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory:isExactReservedName() #95
> +---[0.002938ms] 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory:readLock() #218
> +---[0.00252ms] 
> org.apache.hadoop.hdfs.server.namenode.INodesInPath:isDotSnapshotDir() #220
> +---[0.002788ms] 
> org.apache.hadoop.hdfs.server.namenode.INodesInPath:getPathSnapshotId() #223
> +---[0.002905ms] 
> org.apache.hadoop.hdfs.server.namenode.INodesInPath:getLastINode() #224
> +---[0.002785ms] 
> org.apache.hadoop.hdfs.server.namenode.INode:getStoragePolicyID() #230
> +---[0.002236ms] 
> org.apache.hadoop.hdfs.server.namenode.INode:isDirectory() #233
> +---[0.002919ms] 
> org.apache.hadoop.hdfs.server.namenode.INode:asDirectory() #242
> +---[0.003408ms] 
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory:getChildrenList() #243
> +---[0.005942ms] 
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory:nextChild() #244
> +---[0.002467ms] org.apache.hadoop.hdfs.util.ReadOnlyList:size() #245
> +---[0.005481ms] 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory:getLsLimit() #247
> +---[0.002176ms] 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory:getLsLimit() #248
> +---[min=0.00211ms,max=0.005157ms,total=2.247572ms,count=1000] 
> org.apache.hadoop.hdfs.util.ReadOnlyList:get() #252
> +---[min=0.001946ms,max=0.005411ms,total=2.041715ms,count=1000] 
> org.apache.hadoop.hdfs.server.namenode.INode:isSymlink() #253
> +---[min=0.002176ms,max=0.005426ms,total=2.264472ms,count=1000] 
> org.apache.hadoop.hdfs.server.namenode.INode:getLocalStoragePolicyID() #254
> +---[min=0.002251ms,max=0.006849ms,total=2.351935ms,count=1000] 
> org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp:getStoragePolicyID()
>  #95
> +---[min=0.006091ms,max=0.012333ms,total=6.439434ms,count=1000] 
> org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp:createFileStatus()
>  #257
> +---[min=0.00269ms,max=0.004995ms,total=2.788194ms,count=1000] 
> org.apache.hadoop.hdfs.protocol.HdfsLocatedFileStatus:getLocatedBlocks() #265
> +---[0.003234ms] 
> org.apache.hadoop.hdfs.protocol.DirectoryListing:() #274
> `---[0.002457ms] 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory:readUnlock() #277
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16196) Namesystem#completeFile method will log incorrect path information when router to access

2021-09-05 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17410147#comment-17410147
 ] 

Xiaoqiao He commented on HDFS-16196:


[~lei w], Thanks for you report. Which version do you deploy? Just check the 
logic of trunk, I don't find the different between client invoke directly and 
router proxy invoke. Would you mind to offer any more information about this 
case?

> Namesystem#completeFile method will log incorrect path information when 
> router to access
> 
>
> Key: HDFS-16196
> URL: https://issues.apache.org/jira/browse/HDFS-16196
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: lei w
>Priority: Minor
>
> Router not send entire path information to namenode because 
> ClientProtocol#complete method`s parameter with fileId. Then NameNode will 
> log incorrect path information. This is very confusing, should we let the 
> router pass the path information or modify the log path on  namenode?
> completeFile log as fllow:
> StateChange: DIR* completeFile: / is closed by DFSClient_NONMAPREDUC_*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16200) Improve NameNode failover

2021-09-05 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17410148#comment-17410148
 ] 

Xiaoqiao He commented on HDFS-16200:


Thanks [~aihuaxu] for your report. I think the best way is to improve resolve 
and rack ware performance rather than disable it directly. FYI.

> Improve NameNode failover
> -
>
> Key: HDFS-16200
> URL: https://issues.apache.org/jira/browse/HDFS-16200
> Project: Hadoop HDFS
>  Issue Type: Task
>  Components: namanode
>Affects Versions: 2.8.2
>Reporter: Aihua Xu
>Assignee: Aihua Xu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> In a busy cluster, we are noticing the NameNode failover takes longer time 
> (over 10 minutes) and it causes cluster down time during the time period.
> One bottleneck locates in resolving the client host's topology when the 
> cluster is not colocated with the computing hosts. NameNode resolves the 
> client host's topology and uses it to sort the hosts where the blocks locate 
> in. Such topology will be cached so the next access will be efficient, while 
> if the standby NameNode is newly restarted, then all the client hosts, e.g., 
> YARN hosts need to be resolved.
> Solutions can be: 1) we can expose an API in DFSAdmin to load topology cache, 
> or 2) we can add a new configuration in HDFS cluster to skip resolving 
> topology for non-colocated HDFS cluster. Since client hosts and HDFS hosts 
> are not colocated, it's unnecessary to sort the DataNodes for the clients.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16196) Namesystem#completeFile method will log incorrect path information when router to access

2021-09-06 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17410484#comment-17410484
 ] 

Xiaoqiao He commented on HDFS-16196:


Thanks [~lei w] for your report. It is confused log for users. I think it 
should be fixed no matter at NameNode or Router side. After checking I think It 
is involved by the following `method` construction which give `new 
RemoteParam()` without `src`.  
{code:java}
  @Override
  public boolean complete(String src, String clientName, ExtendedBlock last,
  long fileId) throws IOException {
rpcServer.checkOperation(NameNode.OperationCategory.WRITE);

RemoteMethod method = new RemoteMethod("complete",
new Class[] {String.class, String.class, ExtendedBlock.class,
long.class},
new RemoteParam(), clientName, last, fileId);

if (last != null) {
  return rpcClient.invokeSingle(last, method, Boolean.class);
}

final List locations =
rpcServer.getLocationsForPath(src, true);
// Complete can return true/false, so don't expect a result
return rpcClient.invokeSequential(locations, method, Boolean.class, null);
  }
{code}

Maybe it is common case and some other requests (which use 
`rpcClient.invokeSingle`)could have the same case. Would you like to improve it 
together?

> Namesystem#completeFile method will log incorrect path information when 
> router to access
> 
>
> Key: HDFS-16196
> URL: https://issues.apache.org/jira/browse/HDFS-16196
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: lei w
>Priority: Minor
>
> Router not send entire path information to namenode because 
> ClientProtocol#complete method`s parameter with fileId. Then NameNode will 
> log incorrect path information. This is very confusing, should we let the 
> router pass the path information or modify the log path on  namenode?
> completeFile log as fllow:
> StateChange: DIR* completeFile: / is closed by DFSClient_NONMAPREDUC_*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-15160) ReplicaMap, Disk Balancer, Directory Scanner and various FsDatasetImpl methods should use datanode readlock

2021-09-11 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-15160.

Fix Version/s: 3.2.3
   Resolution: Fixed

> ReplicaMap, Disk Balancer, Directory Scanner and various FsDatasetImpl 
> methods should use datanode readlock
> ---
>
> Key: HDFS-15160
> URL: https://issues.apache.org/jira/browse/HDFS-15160
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.0
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.3, 3.3.1
>
> Attachments: HDFS-15160-branch-3.3-001.patch, HDFS-15160.001.patch, 
> HDFS-15160.002.patch, HDFS-15160.003.patch, HDFS-15160.004.patch, 
> HDFS-15160.005.patch, HDFS-15160.006.patch, HDFS-15160.007.patch, 
> HDFS-15160.008.patch, HDFS-15160.branch-3-3.001.patch, 
> image-2020-04-10-17-18-08-128.png, image-2020-04-10-17-18-55-938.png
>
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> Now we have HDFS-15150, we can start to move some DN operations to use the 
> read lock rather than the write lock to improve concurrence. The first step 
> is to make the changes to ReplicaMap, as many other methods make calls to it.
> This Jira switches read operations against the volume map to use the readLock 
> rather than the write lock.
> Additionally, some methods make a call to replicaMap.replicas() (eg 
> getBlockReports, getFinalizedBlocks, deepCopyReplica) and only use the result 
> in a read only fashion, so they can also be switched to using a readLock.
> Next is the directory scanner and disk balancer, which only require a read 
> lock.
> Finally (for this Jira) are various "low hanging fruit" items in BlockSender 
> and fsdatasetImpl where is it fairly obvious they only need a read lock.
> For now, I have avoided changing anything which looks too risky, as I think 
> its better to do any larger refactoring or risky changes each in their own 
> Jira.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15457) TestFsDatasetImpl fails intermittently

2021-09-11 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15457:
---
Fix Version/s: 3.2.3

> TestFsDatasetImpl fails intermittently
> --
>
> Key: HDFS-15457
> URL: https://issues.apache.org/jira/browse/HDFS-15457
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.3
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> {{TestFsDatasetImpl}} fails intermittently on hadoop trunk.
> {code:bash}
> [ERROR] Tests run: 21, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
> 30.104 s <<< FAILURE! - in 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl
> [ERROR] 
> testReadLockEnabledByDefault(org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl)
>   Time elapsed: 0.128 s  <<< FAILURE!
> java.lang.AssertionError: expected: but was:
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:144)
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl.testReadLockEnabledByDefault(TestFsDatasetImpl.java:237)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15150) Introduce read write lock to Datanode

2021-09-11 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15150:
---
Fix Version/s: 3.2.3

> Introduce read write lock to Datanode
> -
>
> Key: HDFS-15150
> URL: https://issues.apache.org/jira/browse/HDFS-15150
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.0
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
> Fix For: 3.3.0, 2.10.2, 3.2.3
>
> Attachments: HDFS-15150-branch-2.10.001.patch, 
> HDFS-15150-branch-2.10.002.patch, HDFS-15150-branch-2.10.003.patch, 
> HDFS-15150.001.patch, HDFS-15150.002.patch, HDFS-15150.003.patch
>
>
> HDFS-9668 pointed out the issues around the DN lock being a point of 
> contention some time ago, but that Jira went in a direction of creating a new 
> FSDataset implementation which is very risky, and activity on the Jira has 
> stalled for a few years now. Edit: Looks like HDFS-9668 eventually went in a 
> similar direction to what I was thinking, so I will review that Jira in more 
> detail to see if this one is necessary.
> I feel there could be significant gains by moving to a ReentrantReadWrite 
> lock within the DN. The current implementation is simply a ReentrantLock so 
> any locker blocks all others.
> Once place I think a read lock would benefit us significantly, is when the DN 
> is serving a lot of small blocks and there are jobs which perform a lot of 
> reads. The start of reading any blocks right now takes the lock, but if we 
> moved this to a read lock, many reads could do this at the same time.
> The first conservative step, would be to change the current lock and then 
> make all accesses to it obtain the write lock. That way, we should keep the 
> current behaviour and then we can selectively move some lock accesses to the 
> readlock in separate Jiras.
> I would appreciate any thoughts on this, and also if anyone has attempted it 
> before and found any blockers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15818) Fix TestFsDatasetImpl.testReadLockCanBeDisabledByConfig

2021-09-11 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15818:
---
Fix Version/s: 3.2.3

> Fix TestFsDatasetImpl.testReadLockCanBeDisabledByConfig
> ---
>
> Key: HDFS-15818
> URL: https://issues.apache.org/jira/browse/HDFS-15818
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: test
>Reporter: Leon Gao
>Assignee: Leon Gao
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.3
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Current TestFsDatasetImpl.testReadLockCanBeDisabledByConfig is incorrect:
> 1) Test fails intermittently as holder can acquire lock first
> [https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2666/1/testReport/]
>  
> 2) Test passes regardless of the setting of 
> DFS_DATANODE_LOCK_READ_WRITE_ENABLED_KEY



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16223) AvailableSpaceRackFaultTolerantBlockPlacementPolicy should use chooseRandomWithStorageTypeTwoTrial() for better performance.

2021-09-13 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16223.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~ayushtkn] for your works!

> AvailableSpaceRackFaultTolerantBlockPlacementPolicy should use 
> chooseRandomWithStorageTypeTwoTrial() for better performance.
> 
>
> Key: HDFS-16223
> URL: https://issues.apache.org/jira/browse/HDFS-16223
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Use chooseRandomWithStorageTypeTwoTrial as AvailableSpaceBlockPlacementPolicy 
> does.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15457) TestFsDatasetImpl fails intermittently

2021-09-14 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15457:
---
Fix Version/s: 3.2.4

> TestFsDatasetImpl fails intermittently
> --
>
> Key: HDFS-15457
> URL: https://issues.apache.org/jira/browse/HDFS-15457
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.3, 3.2.4
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> {{TestFsDatasetImpl}} fails intermittently on hadoop trunk.
> {code:bash}
> [ERROR] Tests run: 21, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
> 30.104 s <<< FAILURE! - in 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl
> [ERROR] 
> testReadLockEnabledByDefault(org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl)
>   Time elapsed: 0.128 s  <<< FAILURE!
> java.lang.AssertionError: expected: but was:
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:144)
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl.testReadLockEnabledByDefault(TestFsDatasetImpl.java:237)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15160) ReplicaMap, Disk Balancer, Directory Scanner and various FsDatasetImpl methods should use datanode readlock

2021-09-14 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15160:
---
Fix Version/s: 3.2.3

> ReplicaMap, Disk Balancer, Directory Scanner and various FsDatasetImpl 
> methods should use datanode readlock
> ---
>
> Key: HDFS-15160
> URL: https://issues.apache.org/jira/browse/HDFS-15160
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.0
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0, 3.2.3, 3.2.4
>
> Attachments: HDFS-15160-branch-3.3-001.patch, HDFS-15160.001.patch, 
> HDFS-15160.002.patch, HDFS-15160.003.patch, HDFS-15160.004.patch, 
> HDFS-15160.005.patch, HDFS-15160.006.patch, HDFS-15160.007.patch, 
> HDFS-15160.008.patch, HDFS-15160.branch-3-3.001.patch, 
> image-2020-04-10-17-18-08-128.png, image-2020-04-10-17-18-55-938.png
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> Now we have HDFS-15150, we can start to move some DN operations to use the 
> read lock rather than the write lock to improve concurrence. The first step 
> is to make the changes to ReplicaMap, as many other methods make calls to it.
> This Jira switches read operations against the volume map to use the readLock 
> rather than the write lock.
> Additionally, some methods make a call to replicaMap.replicas() (eg 
> getBlockReports, getFinalizedBlocks, deepCopyReplica) and only use the result 
> in a read only fashion, so they can also be switched to using a readLock.
> Next is the directory scanner and disk balancer, which only require a read 
> lock.
> Finally (for this Jira) are various "low hanging fruit" items in BlockSender 
> and fsdatasetImpl where is it fairly obvious they only need a read lock.
> For now, I have avoided changing anything which looks too risky, as I think 
> its better to do any larger refactoring or risky changes each in their own 
> Jira.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15818) Fix TestFsDatasetImpl.testReadLockCanBeDisabledByConfig

2021-09-14 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15818:
---
Fix Version/s: 3.2.4

> Fix TestFsDatasetImpl.testReadLockCanBeDisabledByConfig
> ---
>
> Key: HDFS-15818
> URL: https://issues.apache.org/jira/browse/HDFS-15818
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: test
>Reporter: Leon Gao
>Assignee: Leon Gao
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.3, 3.2.4
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Current TestFsDatasetImpl.testReadLockCanBeDisabledByConfig is incorrect:
> 1) Test fails intermittently as holder can acquire lock first
> [https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2666/1/testReport/]
>  
> 2) Test passes regardless of the setting of 
> DFS_DATANODE_LOCK_READ_WRITE_ENABLED_KEY



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15150) Introduce read write lock to Datanode

2021-09-14 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15150:
---
Fix Version/s: 3.2.4

> Introduce read write lock to Datanode
> -
>
> Key: HDFS-15150
> URL: https://issues.apache.org/jira/browse/HDFS-15150
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.0
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
> Fix For: 3.3.0, 2.10.2, 3.2.3, 3.2.4
>
> Attachments: HDFS-15150-branch-2.10.001.patch, 
> HDFS-15150-branch-2.10.002.patch, HDFS-15150-branch-2.10.003.patch, 
> HDFS-15150.001.patch, HDFS-15150.002.patch, HDFS-15150.003.patch
>
>
> HDFS-9668 pointed out the issues around the DN lock being a point of 
> contention some time ago, but that Jira went in a direction of creating a new 
> FSDataset implementation which is very risky, and activity on the Jira has 
> stalled for a few years now. Edit: Looks like HDFS-9668 eventually went in a 
> similar direction to what I was thinking, so I will review that Jira in more 
> detail to see if this one is necessary.
> I feel there could be significant gains by moving to a ReentrantReadWrite 
> lock within the DN. The current implementation is simply a ReentrantLock so 
> any locker blocks all others.
> Once place I think a read lock would benefit us significantly, is when the DN 
> is serving a lot of small blocks and there are jobs which perform a lot of 
> reads. The start of reading any blocks right now takes the lock, but if we 
> moved this to a read lock, many reads could do this at the same time.
> The first conservative step, would be to change the current lock and then 
> make all accesses to it obtain the write lock. That way, we should keep the 
> current behaviour and then we can selectively move some lock accesses to the 
> readlock in separate Jiras.
> I would appreciate any thoughts on this, and also if anyone has attempted it 
> before and found any blockers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15160) ReplicaMap, Disk Balancer, Directory Scanner and various FsDatasetImpl methods should use datanode readlock

2021-09-14 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414794#comment-17414794
 ] 

Xiaoqiao He commented on HDFS-15160:


Thanks [~brahmareddy] for your information. I have cherry-pick to both 
branch-3.2.3 and branch-3.2.4 with single commit rather than squash.

> ReplicaMap, Disk Balancer, Directory Scanner and various FsDatasetImpl 
> methods should use datanode readlock
> ---
>
> Key: HDFS-15160
> URL: https://issues.apache.org/jira/browse/HDFS-15160
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.0
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0, 3.2.3, 3.2.4
>
> Attachments: HDFS-15160-branch-3.3-001.patch, HDFS-15160.001.patch, 
> HDFS-15160.002.patch, HDFS-15160.003.patch, HDFS-15160.004.patch, 
> HDFS-15160.005.patch, HDFS-15160.006.patch, HDFS-15160.007.patch, 
> HDFS-15160.008.patch, HDFS-15160.branch-3-3.001.patch, 
> image-2020-04-10-17-18-08-128.png, image-2020-04-10-17-18-55-938.png
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> Now we have HDFS-15150, we can start to move some DN operations to use the 
> read lock rather than the write lock to improve concurrence. The first step 
> is to make the changes to ReplicaMap, as many other methods make calls to it.
> This Jira switches read operations against the volume map to use the readLock 
> rather than the write lock.
> Additionally, some methods make a call to replicaMap.replicas() (eg 
> getBlockReports, getFinalizedBlocks, deepCopyReplica) and only use the result 
> in a read only fashion, so they can also be switched to using a readLock.
> Next is the directory scanner and disk balancer, which only require a read 
> lock.
> Finally (for this Jira) are various "low hanging fruit" items in BlockSender 
> and fsdatasetImpl where is it fairly obvious they only need a read lock.
> For now, I have avoided changing anything which looks too risky, as I think 
> its better to do any larger refactoring or risky changes each in their own 
> Jira.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2021-09-14 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-14997:
---
Attachment: HDFS-14997-branch-3.2.001.patch
Status: Patch Available  (was: Reopened)

submit patch for branch-3.2 and try to trigger Jenkins.

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14997-branch-3.2.001.patch, HDFS-14997.001.patch, 
> HDFS-14997.002.patch, HDFS-14997.003.patch, HDFS-14997.004.patch, 
> HDFS-14997.005.patch, HDFS-14997.addendum.patch, 
> image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Reopened] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2021-09-14 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He reopened HDFS-14997:


> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14997-branch-3.2.001.patch, HDFS-14997.001.patch, 
> HDFS-14997.002.patch, HDFS-14997.003.patch, HDFS-14997.004.patch, 
> HDFS-14997.005.patch, HDFS-14997.addendum.patch, 
> image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2021-09-17 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416630#comment-17416630
 ] 

Xiaoqiao He commented on HDFS-14997:


Failed unit tests seems not related to this changes. Will commit to branch-3.2 
for a short while.

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14997-branch-3.2.001.patch, HDFS-14997.001.patch, 
> HDFS-14997.002.patch, HDFS-14997.003.patch, HDFS-14997.004.patch, 
> HDFS-14997.005.patch, HDFS-14997.addendum.patch, 
> image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2021-09-17 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-14997:
---
Fix Version/s: 3.2.4
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.2.4, 3.3.0
>
> Attachments: HDFS-14997-branch-3.2.001.patch, HDFS-14997.001.patch, 
> HDFS-14997.002.patch, HDFS-14997.003.patch, HDFS-14997.004.patch, 
> HDFS-14997.005.patch, HDFS-14997.addendum.patch, 
> image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2021-09-17 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416722#comment-17416722
 ] 

Xiaoqiao He commented on HDFS-14997:


[~brahmareddy] This patch is ready now, less conflict and verify at local, I 
just check in for branch-3.2. If no other objection, I will check in this 
related patch to branch-3.2.3 also. What do you think? Thanks.

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0, 3.2.4
>
> Attachments: HDFS-14997-branch-3.2.001.patch, HDFS-14997.001.patch, 
> HDFS-14997.002.patch, HDFS-14997.003.patch, HDFS-14997.004.patch, 
> HDFS-14997.005.patch, HDFS-14997.addendum.patch, 
> image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15075) Remove process command timing from BPServiceActor

2021-09-17 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15075:
---
Fix Version/s: 3.2.4

> Remove process command timing from BPServiceActor
> -
>
> Key: HDFS-15075
> URL: https://issues.apache.org/jira/browse/HDFS-15075
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0, 3.2.4
>
> Attachments: HDFS-15075.001.patch, HDFS-15075.002.patch, 
> HDFS-15075.003.patch, HDFS-15075.004.patch, HDFS-15075.005.patch, 
> HDFS-15075.006.patch, HDFS-15075.007.patch, HDFS-15075.008.patch, 
> HDFS-15075.009.patch
>
>
> HDFS-14997 moved the command processing into async.
> Right now, we are checking the time to add to a queue.
> We should remove this one and maybe move the timing within the thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15651) Client could not obtain block when DN CommandProcessingThread exit

2021-09-17 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15651:
---
Fix Version/s: 3.2.4

> Client could not obtain block when DN CommandProcessingThread exit
> --
>
> Key: HDFS-15651
> URL: https://issues.apache.org/jira/browse/HDFS-15651
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Yiqun Lin
>Assignee: Aiphago
>Priority: Major
> Fix For: 3.3.1, 3.4.0, 3.2.4
>
> Attachments: HDFS-15651.001.patch, HDFS-15651.002.patch, 
> HDFS-15651.patch
>
>
> In our cluster, we applied the HDFS-14997 improvement.
>  We find one case that CommandProcessingThread will exit due to OOM error. 
> OOM error was caused by our one abnormal application that running on this DN 
> node.
> {noformat}
> 2020-10-18 10:27:12,604 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: Command processor 
> encountered fatal exception and exit.
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1221)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1208)
> {noformat}
> Here the main point is that CommandProcessingThread crashed will lead a very 
> bad impact. All the NN response commands will not be processed by DN side.
> We enabled the block token to access the data, but here the DN command 
> DNA_ACCESSKEYUPDATE is not processed on time by DN. And then we see lots of 
> Sasl error due to key expiration in DN log:
> {noformat}
> javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password 
> [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
> re-compute password for block_token_identifier (expiryDate=xxx, keyId=xx, 
> userId=xxx, blockPoolId=, blockId=xxx, access modes=[READ]), since the 
> required block key (keyID=xxx) doesn't exist.]
> {noformat}
>  
> For the impact in client side, our users receive lots of 'could not obtain 
> block' error  with BlockMissingException.
> CommandProcessingThread is a critical thread, it should always be running.
> {code:java}
>   /**
>* CommandProcessingThread that process commands asynchronously.
>*/
>   class CommandProcessingThread extends Thread {
> private final BPServiceActor actor;
> private final BlockingQueue queue;
> ...
> @Override
> public void run() {
>   try {
> processQueue();
>   } catch (Throwable t) {
> LOG.error("{} encountered fatal exception and exit.", getName(), t);  
>  <=== should not exit this thread
>   }
> }
> {code}
> Once a unexpected error happened, a better handing should be:
>  * catch the exception, appropriately deal with the error and let 
> processQueue continue to run
>  or
>  * exit the DN process to let admin user investigate this



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15651) Client could not obtain block when DN CommandProcessingThread exit

2021-09-17 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416725#comment-17416725
 ] 

Xiaoqiao He commented on HDFS-15651:


Cherry-pick to branch-3.2.

> Client could not obtain block when DN CommandProcessingThread exit
> --
>
> Key: HDFS-15651
> URL: https://issues.apache.org/jira/browse/HDFS-15651
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Yiqun Lin
>Assignee: Aiphago
>Priority: Major
> Fix For: 3.3.1, 3.4.0, 3.2.4
>
> Attachments: HDFS-15651.001.patch, HDFS-15651.002.patch, 
> HDFS-15651.patch
>
>
> In our cluster, we applied the HDFS-14997 improvement.
>  We find one case that CommandProcessingThread will exit due to OOM error. 
> OOM error was caused by our one abnormal application that running on this DN 
> node.
> {noformat}
> 2020-10-18 10:27:12,604 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: Command processor 
> encountered fatal exception and exit.
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1221)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1208)
> {noformat}
> Here the main point is that CommandProcessingThread crashed will lead a very 
> bad impact. All the NN response commands will not be processed by DN side.
> We enabled the block token to access the data, but here the DN command 
> DNA_ACCESSKEYUPDATE is not processed on time by DN. And then we see lots of 
> Sasl error due to key expiration in DN log:
> {noformat}
> javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password 
> [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
> re-compute password for block_token_identifier (expiryDate=xxx, keyId=xx, 
> userId=xxx, blockPoolId=, blockId=xxx, access modes=[READ]), since the 
> required block key (keyID=xxx) doesn't exist.]
> {noformat}
>  
> For the impact in client side, our users receive lots of 'could not obtain 
> block' error  with BlockMissingException.
> CommandProcessingThread is a critical thread, it should always be running.
> {code:java}
>   /**
>* CommandProcessingThread that process commands asynchronously.
>*/
>   class CommandProcessingThread extends Thread {
> private final BPServiceActor actor;
> private final BlockingQueue queue;
> ...
> @Override
> public void run() {
>   try {
> processQueue();
>   } catch (Throwable t) {
> LOG.error("{} encountered fatal exception and exit.", getName(), t);  
>  <=== should not exit this thread
>   }
> }
> {code}
> Once a unexpected error happened, a better handing should be:
>  * catch the exception, appropriately deal with the error and let 
> processQueue continue to run
>  or
>  * exit the DN process to let admin user investigate this



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15075) Remove process command timing from BPServiceActor

2021-09-17 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416724#comment-17416724
 ] 

Xiaoqiao He commented on HDFS-15075:


Cherry-pick to branch-3.2

> Remove process command timing from BPServiceActor
> -
>
> Key: HDFS-15075
> URL: https://issues.apache.org/jira/browse/HDFS-15075
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0, 3.2.4
>
> Attachments: HDFS-15075.001.patch, HDFS-15075.002.patch, 
> HDFS-15075.003.patch, HDFS-15075.004.patch, HDFS-15075.005.patch, 
> HDFS-15075.006.patch, HDFS-15075.007.patch, HDFS-15075.008.patch, 
> HDFS-15075.009.patch
>
>
> HDFS-14997 moved the command processing into async.
> Right now, we are checking the time to add to a queue.
> We should remove this one and maybe move the timing within the thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15113) Missing IBR when NameNode restart if open processCommand async feature

2021-09-17 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416727#comment-17416727
 ] 

Xiaoqiao He commented on HDFS-15113:


Cherry-pick to branch-3.2.

> Missing IBR when NameNode restart if open processCommand async feature
> --
>
> Key: HDFS-15113
> URL: https://issues.apache.org/jira/browse/HDFS-15113
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Blocker
> Fix For: 3.3.0, 3.2.4
>
> Attachments: HDFS-15113.001.patch, HDFS-15113.002.patch, 
> HDFS-15113.003.patch, HDFS-15113.004.patch, HDFS-15113.005.patch, 
> HDFS-15113.addendum.patch
>
>
> Recently, I meet one case that NameNode missing block after restart which is 
> related with HDFS-14997.
> a. during NameNode restart, it will return command `DNA_REGISTER` to DataNode 
> when receive some RPC request from DataNode.
> b. when DataNode receive `DNA_REGISTER` command, it will run #reRegister 
> async.
> {code:java}
>   void reRegister() throws IOException {
> if (shouldRun()) {
>   // re-retrieve namespace info to make sure that, if the NN
>   // was restarted, we still match its version (HDFS-2120)
>   NamespaceInfo nsInfo = retrieveNamespaceInfo();
>   // and re-register
>   register(nsInfo);
>   scheduler.scheduleHeartbeat();
>   // HDFS-9917,Standby NN IBR can be very huge if standby namenode is down
>   // for sometime.
>   if (state == HAServiceState.STANDBY || state == 
> HAServiceState.OBSERVER) {
> ibrManager.clearIBRs();
>   }
> }
>   }
> {code}
> c. As we know, #register will trigger BR immediately.
> d. because #reRegister run async, so we could not make sure which one run 
> first between send FBR and clear IBR. If clean IBR run first, it will be OK. 
> But if send FBR first then clear IBR, it will missing some blocks received 
> between these two time point until next FBR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15113) Missing IBR when NameNode restart if open processCommand async feature

2021-09-17 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15113:
---
Fix Version/s: 3.2.4

> Missing IBR when NameNode restart if open processCommand async feature
> --
>
> Key: HDFS-15113
> URL: https://issues.apache.org/jira/browse/HDFS-15113
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Blocker
> Fix For: 3.3.0, 3.2.4
>
> Attachments: HDFS-15113.001.patch, HDFS-15113.002.patch, 
> HDFS-15113.003.patch, HDFS-15113.004.patch, HDFS-15113.005.patch, 
> HDFS-15113.addendum.patch
>
>
> Recently, I meet one case that NameNode missing block after restart which is 
> related with HDFS-14997.
> a. during NameNode restart, it will return command `DNA_REGISTER` to DataNode 
> when receive some RPC request from DataNode.
> b. when DataNode receive `DNA_REGISTER` command, it will run #reRegister 
> async.
> {code:java}
>   void reRegister() throws IOException {
> if (shouldRun()) {
>   // re-retrieve namespace info to make sure that, if the NN
>   // was restarted, we still match its version (HDFS-2120)
>   NamespaceInfo nsInfo = retrieveNamespaceInfo();
>   // and re-register
>   register(nsInfo);
>   scheduler.scheduleHeartbeat();
>   // HDFS-9917,Standby NN IBR can be very huge if standby namenode is down
>   // for sometime.
>   if (state == HAServiceState.STANDBY || state == 
> HAServiceState.OBSERVER) {
> ibrManager.clearIBRs();
>   }
> }
>   }
> {code}
> c. As we know, #register will trigger BR immediately.
> d. because #reRegister run async, so we could not make sure which one run 
> first between send FBR and clear IBR. If clean IBR run first, it will be OK. 
> But if send FBR first then clear IBR, it will missing some blocks received 
> between these two time point until next FBR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2021-09-22 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-14997:
---
Fix Version/s: 3.2.3

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0, 3.2.3, 3.2.4
>
> Attachments: HDFS-14997-branch-3.2.001.patch, HDFS-14997.001.patch, 
> HDFS-14997.002.patch, HDFS-14997.003.patch, HDFS-14997.004.patch, 
> HDFS-14997.005.patch, HDFS-14997.addendum.patch, 
> image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2021-09-22 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17418516#comment-17418516
 ] 

Xiaoqiao He commented on HDFS-14997:


cherry-pick to branch-3.2 and branch-3.2.3.

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0, 3.2.3, 3.2.4
>
> Attachments: HDFS-14997-branch-3.2.001.patch, HDFS-14997.001.patch, 
> HDFS-14997.002.patch, HDFS-14997.003.patch, HDFS-14997.004.patch, 
> HDFS-14997.005.patch, HDFS-14997.addendum.patch, 
> image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15075) Remove process command timing from BPServiceActor

2021-09-22 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15075:
---
Fix Version/s: 3.2.3

cherry-pick to branch-3.2.3.

> Remove process command timing from BPServiceActor
> -
>
> Key: HDFS-15075
> URL: https://issues.apache.org/jira/browse/HDFS-15075
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0, 3.2.3, 3.2.4
>
> Attachments: HDFS-15075.001.patch, HDFS-15075.002.patch, 
> HDFS-15075.003.patch, HDFS-15075.004.patch, HDFS-15075.005.patch, 
> HDFS-15075.006.patch, HDFS-15075.007.patch, HDFS-15075.008.patch, 
> HDFS-15075.009.patch
>
>
> HDFS-14997 moved the command processing into async.
> Right now, we are checking the time to add to a queue.
> We should remove this one and maybe move the timing within the thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15651) Client could not obtain block when DN CommandProcessingThread exit

2021-09-22 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15651:
---
Fix Version/s: 3.2.3

cherry-pick to branch-3.2.3.

> Client could not obtain block when DN CommandProcessingThread exit
> --
>
> Key: HDFS-15651
> URL: https://issues.apache.org/jira/browse/HDFS-15651
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Yiqun Lin
>Assignee: Aiphago
>Priority: Major
> Fix For: 3.3.1, 3.4.0, 3.2.3, 3.2.4
>
> Attachments: HDFS-15651.001.patch, HDFS-15651.002.patch, 
> HDFS-15651.patch
>
>
> In our cluster, we applied the HDFS-14997 improvement.
>  We find one case that CommandProcessingThread will exit due to OOM error. 
> OOM error was caused by our one abnormal application that running on this DN 
> node.
> {noformat}
> 2020-10-18 10:27:12,604 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: Command processor 
> encountered fatal exception and exit.
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1221)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1208)
> {noformat}
> Here the main point is that CommandProcessingThread crashed will lead a very 
> bad impact. All the NN response commands will not be processed by DN side.
> We enabled the block token to access the data, but here the DN command 
> DNA_ACCESSKEYUPDATE is not processed on time by DN. And then we see lots of 
> Sasl error due to key expiration in DN log:
> {noformat}
> javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password 
> [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
> re-compute password for block_token_identifier (expiryDate=xxx, keyId=xx, 
> userId=xxx, blockPoolId=, blockId=xxx, access modes=[READ]), since the 
> required block key (keyID=xxx) doesn't exist.]
> {noformat}
>  
> For the impact in client side, our users receive lots of 'could not obtain 
> block' error  with BlockMissingException.
> CommandProcessingThread is a critical thread, it should always be running.
> {code:java}
>   /**
>* CommandProcessingThread that process commands asynchronously.
>*/
>   class CommandProcessingThread extends Thread {
> private final BPServiceActor actor;
> private final BlockingQueue queue;
> ...
> @Override
> public void run() {
>   try {
> processQueue();
>   } catch (Throwable t) {
> LOG.error("{} encountered fatal exception and exit.", getName(), t);  
>  <=== should not exit this thread
>   }
> }
> {code}
> Once a unexpected error happened, a better handing should be:
>  * catch the exception, appropriately deal with the error and let 
> processQueue continue to run
>  or
>  * exit the DN process to let admin user investigate this



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15113) Missing IBR when NameNode restart if open processCommand async feature

2021-09-22 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15113:
---
Fix Version/s: 3.2.3

cherry-pick to branch-3.2.3.

> Missing IBR when NameNode restart if open processCommand async feature
> --
>
> Key: HDFS-15113
> URL: https://issues.apache.org/jira/browse/HDFS-15113
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Blocker
> Fix For: 3.3.0, 3.2.3, 3.2.4
>
> Attachments: HDFS-15113.001.patch, HDFS-15113.002.patch, 
> HDFS-15113.003.patch, HDFS-15113.004.patch, HDFS-15113.005.patch, 
> HDFS-15113.addendum.patch
>
>
> Recently, I meet one case that NameNode missing block after restart which is 
> related with HDFS-14997.
> a. during NameNode restart, it will return command `DNA_REGISTER` to DataNode 
> when receive some RPC request from DataNode.
> b. when DataNode receive `DNA_REGISTER` command, it will run #reRegister 
> async.
> {code:java}
>   void reRegister() throws IOException {
> if (shouldRun()) {
>   // re-retrieve namespace info to make sure that, if the NN
>   // was restarted, we still match its version (HDFS-2120)
>   NamespaceInfo nsInfo = retrieveNamespaceInfo();
>   // and re-register
>   register(nsInfo);
>   scheduler.scheduleHeartbeat();
>   // HDFS-9917,Standby NN IBR can be very huge if standby namenode is down
>   // for sometime.
>   if (state == HAServiceState.STANDBY || state == 
> HAServiceState.OBSERVER) {
> ibrManager.clearIBRs();
>   }
> }
>   }
> {code}
> c. As we know, #register will trigger BR immediately.
> d. because #reRegister run async, so we could not make sure which one run 
> first between send FBR and clear IBR. If clean IBR run first, it will be OK. 
> But if send FBR first then clear IBR, it will missing some blocks received 
> between these two time point until next FBR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16235) Deadlock in LeaseRenewer for static remove method

2021-09-29 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17422054#comment-17422054
 ] 

Xiaoqiao He commented on HDFS-16235:


Cherry-pick to branch-3.3 and branch-3.2(include branch-3.2.3)

> Deadlock in LeaseRenewer for static remove method
> -
>
> Key: HDFS-16235
> URL: https://issues.apache.org/jira/browse/HDFS-16235
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.3, 3.3.2, 3.2.4
>
> Attachments: HDFS-16235.001.patch, image-2021-09-23-19-31-57-337.png
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> !image-2021-09-23-19-31-57-337.png|width=3339,height=1936!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16235) Deadlock in LeaseRenewer for static remove method

2021-09-29 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16235:
---
Fix Version/s: 3.2.4
   3.3.2
   3.2.3

> Deadlock in LeaseRenewer for static remove method
> -
>
> Key: HDFS-16235
> URL: https://issues.apache.org/jira/browse/HDFS-16235
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.3, 3.3.2, 3.2.4
>
> Attachments: HDFS-16235.001.patch, image-2021-09-23-19-31-57-337.png
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> !image-2021-09-23-19-31-57-337.png|width=3339,height=1936!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14575) LeaseRenewer#daemon threads leak in DFSClient

2021-09-29 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17422055#comment-17422055
 ] 

Xiaoqiao He commented on HDFS-14575:


Thanks for reminder. Cherry-pick to branch-3.3 and branch-3.2(include 
branch-3.2.3). Thanks [~weichiu],[~prasad-acit].

> LeaseRenewer#daemon threads leak in DFSClient
> -
>
> Key: HDFS-14575
> URL: https://issues.apache.org/jira/browse/HDFS-14575
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Tao Yang
>Assignee: Renukaprasad C
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: HDFS-14575.001.patch, HDFS-14575.002.patch, 
> HDFS-14575.003.patch, HDFS-14575.004.patch
>
>
> Currently LeaseRenewer (and its daemon thread) without clients should be 
> terminated after a grace period which defaults to 60 seconds. A race 
> condition may happen when a new request is coming just after LeaseRenewer 
> expired.
>  Reproduce this race condition:
>  # Client#1 creates File#1: creates LeaseRenewer#1 and starts Daemon#1 
> thread, after a few seconds, File#1 is closed , there is no clients in 
> LeaseRenewer#1 now.
>  # 60 seconds (grace period) later, LeaseRenewer#1 just expires but daemon#1 
> thread is still in sleep, Client#1 creates File#2, lead to the creation of 
> Daemon#2.
>  # Daemon#1 is awake then exit, after that, LeaseRenewer#1 is removed from 
> factory.
>  # File#2 is closed after a few seconds, LeaseRenewer#2 is created since it 
> can’t get renewer from factory.
> Daemon#2 thread leaks from now on, since Client#1 in it can never be removed 
> and it won't have a chance to stop.
> To solve this problem, IIUIC, a simple way I think is to make sure that all 
> clients are cleared when LeaseRenewer is removed from factory. Please feel 
> free to give your suggestions. Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14575) LeaseRenewer#daemon threads leak in DFSClient

2021-09-29 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-14575:
---
Fix Version/s: 3.2.4
   3.3.2
   3.2.3

> LeaseRenewer#daemon threads leak in DFSClient
> -
>
> Key: HDFS-14575
> URL: https://issues.apache.org/jira/browse/HDFS-14575
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Tao Yang
>Assignee: Renukaprasad C
>Priority: Major
> Fix For: 3.4.0, 3.2.3, 3.3.2, 3.2.4
>
> Attachments: HDFS-14575.001.patch, HDFS-14575.002.patch, 
> HDFS-14575.003.patch, HDFS-14575.004.patch
>
>
> Currently LeaseRenewer (and its daemon thread) without clients should be 
> terminated after a grace period which defaults to 60 seconds. A race 
> condition may happen when a new request is coming just after LeaseRenewer 
> expired.
>  Reproduce this race condition:
>  # Client#1 creates File#1: creates LeaseRenewer#1 and starts Daemon#1 
> thread, after a few seconds, File#1 is closed , there is no clients in 
> LeaseRenewer#1 now.
>  # 60 seconds (grace period) later, LeaseRenewer#1 just expires but daemon#1 
> thread is still in sleep, Client#1 creates File#2, lead to the creation of 
> Daemon#2.
>  # Daemon#1 is awake then exit, after that, LeaseRenewer#1 is removed from 
> factory.
>  # File#2 is closed after a few seconds, LeaseRenewer#2 is created since it 
> can’t get renewer from factory.
> Daemon#2 thread leaks from now on, since Client#1 in it can never be removed 
> and it won't have a chance to stop.
> To solve this problem, IIUIC, a simple way I think is to make sure that all 
> clients are cleared when LeaseRenewer is removed from factory. Please feel 
> free to give your suggestions. Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16322) The NameNode implementation of ClientProtocol.truncate(...) can cause data loss.

2021-11-14 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443263#comment-17443263
 ] 

Xiaoqiao He commented on HDFS-16322:


Thanks [~Nsupyq] for your report. It is interesting case. I am not sure if it 
is reasonable to disable `idempotent` at HDFS-7926, but it could cause data 
loss when client retry request for some network or other issues. IMO, it could 
be fixed when enable `idempotent` for truncate.

{quote}"idempotent" means applying the same operations multiple times will get 
the same result. If there is an append in the middle, the retry could get 
different results.

E.g. getPermission is idempotent. However, if there is a setPermission (or 
delete, rename, etc.) in the middle, the retry of getPermission could get a 
different result.{quote}
Just notice that [~szetszwo] leave this comment at HDFS-7926, Not sure if this 
explain is proper now. Such as Client A request `create` with overwrite 
operation and execute successful at NameNode side but not response to Client A, 
then it will retry. Before the retry request to NameNode, another Client B 
delete this file. Then retry request has invoked and return the last result 
because retry cache. It is the same case as `truncate`. 
cc [~shv], [~szetszwo] would you mind to give some suggestions? Thanks.

> The NameNode implementation of ClientProtocol.truncate(...) can cause data 
> loss.
> 
>
> Key: HDFS-16322
> URL: https://issues.apache.org/jira/browse/HDFS-16322
> Project: Hadoop HDFS
>  Issue Type: Bug
> Environment: The runtime environment is Ubuntu 18.04, Java 1.8.0_222 
> and Apache Maven 3.6.0. 
> The bug can be reproduced by the the testMultipleTruncate() in the 
> attachment. First, replace the file TestFileTruncate.java under the directory 
> "hadoop-3.3.1-src/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/"
>  with the attachment. Then run "mvn test 
> -Dtest=org.apache.hadoop.hdfs.server.namenode.TestFileTruncate#testMultipleTruncate"
>  to run the testcase. Finally the "assertFileLength(p, n+newLength)" at 199 
> line of TestFileTruncate.java will abort. Because the retry of truncate() 
> changes the file size and cause data loss.
>Reporter: nhaorand
>Priority: Major
> Attachments: TestFileTruncate.java
>
>
> The NameNode implementation of ClientProtocol.truncate(...) can cause data 
> loss. If dfsclient drops the first response of a truncate RPC call, the retry 
> by retry cache will truncate the file again and cause data loss.
> HDFS-7926 avoids repeated execution of truncate(...) by checking if the file 
> is already being truncated with the same length. However, under concurrency, 
> after the first execution of truncate(...), concurrent requests from other 
> clients may append new data and change the file length. When truncate(...) is 
> retried after that, it will find the file has not been truncated with the 
> same length and truncate it again, which causes data loss.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16320) Datanode retrieve slownode information from NameNode

2021-11-14 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443268#comment-17443268
 ] 

Xiaoqiao He commented on HDFS-16320:


Thanks [~Symious] for your report and patch. IMO it is a little tricky for 
DataNode to get the slownode status from NameNode. In theory, DataNode has the 
total information to decide if it is slow by itself rather than following 
NameNode command. Would you mind to offer some more information about your plan 
to using this status? Thanks.

> Datanode retrieve slownode information from NameNode
> 
>
> Key: HDFS-16320
> URL: https://issues.apache.org/jira/browse/HDFS-16320
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Janus Chow
>Assignee: Janus Chow
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The current information of slownode is reported by reportingNode, and stored 
> in NameNode.
> This ticket is to let the slownode retrieve the information from NameNode, so 
> that it can do other performance improvement actions based on this 
> information.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16318) Add exception blockinfo

2021-11-14 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443273#comment-17443273
 ] 

Xiaoqiao He commented on HDFS-16318:


[~philipse] would you mind to add some information about `Description` for what 
issue and how to fix. Thanks.

> Add exception blockinfo
> ---
>
> Key: HDFS-16318
> URL: https://issues.apache.org/jira/browse/HDFS-16318
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.3.1
>Reporter: guo
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16320) Datanode retrieve slownode information from NameNode

2021-11-14 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443283#comment-17443283
 ] 

Xiaoqiao He commented on HDFS-16320:


Thanks [~Symious] for your quick response. It is reasonable case for me. I mean 
that DataNode has the total information to decide if he is SLOW based on 
response time or throughput rather than based on command from NameNode. 
Furthermore there is possible to false positive at NameNode side.
I am not against the idea but we should have more proper way to solve this 
problem. IMO, client and DataNode/Pipeline communication could estimate if 
there are slow nodes and which one is slow. FYI. Thanks.

> Datanode retrieve slownode information from NameNode
> 
>
> Key: HDFS-16320
> URL: https://issues.apache.org/jira/browse/HDFS-16320
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Janus Chow
>Assignee: Janus Chow
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The current information of slownode is reported by reportingNode, and stored 
> in NameNode.
> This ticket is to let the slownode retrieve the information from NameNode, so 
> that it can do other performance improvement actions based on this 
> information.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16322) The NameNode implementation of ClientProtocol.truncate(...) can cause data loss.

2021-11-14 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443550#comment-17443550
 ] 

Xiaoqiao He commented on HDFS-16322:


Thanks [~szetszwo] for your response. Considering the following case:
A. Client A request truncate for file foo at time t0.
B. NameNode executed truncate request and response to Client A at time t1. ​
C. Client B request truncate and append for file foo and file length not 
changes but content changes actually at time t2.
D. Client A not receive response and retry it (end user can not dominate this 
action), NameNode will re-execute it because no RetryCache for truncate request 
then response at t3.
After t3, the file content will not be expected. (NOTE: t0 The NameNode implementation of ClientProtocol.truncate(...) can cause data 
> loss.
> 
>
> Key: HDFS-16322
> URL: https://issues.apache.org/jira/browse/HDFS-16322
> Project: Hadoop HDFS
>  Issue Type: Bug
> Environment: The runtime environment is Ubuntu 18.04, Java 1.8.0_222 
> and Apache Maven 3.6.0. 
> The bug can be reproduced by the the testMultipleTruncate() in the 
> attachment. First, replace the file TestFileTruncate.java under the directory 
> "hadoop-3.3.1-src/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/"
>  with the attachment. Then run "mvn test 
> -Dtest=org.apache.hadoop.hdfs.server.namenode.TestFileTruncate#testMultipleTruncate"
>  to run the testcase. Finally the "assertFileLength(p, n+newLength)" at 199 
> line of TestFileTruncate.java will abort. Because the retry of truncate() 
> changes the file size and cause data loss.
>Reporter: nhaorand
>Priority: Major
> Attachments: TestFileTruncate.java
>
>
> The NameNode implementation of ClientProtocol.truncate(...) can cause data 
> loss. If dfsclient drops the first response of a truncate RPC call, the retry 
> by retry cache will truncate the file again and cause data loss.
> HDFS-7926 avoids repeated execution of truncate(...) by checking if the file 
> is already being truncated with the same length. However, under concurrency, 
> after the first execution of truncate(...), concurrent requests from other 
> clients may append new data and change the file length. When truncate(...) is 
> retried after that, it will find the file has not been truncated with the 
> same length and truncate it again, which causes data loss.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16322) The NameNode implementation of ClientProtocol.truncate(...) can cause data loss.

2021-11-17 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445627#comment-17445627
 ] 

Xiaoqiao He commented on HDFS-16322:


Great, it is graceful solution to enable RetryCache for truncate for me, +1. 
Thanks [~shv] and [~szetszwo] for your comments.

> The NameNode implementation of ClientProtocol.truncate(...) can cause data 
> loss.
> 
>
> Key: HDFS-16322
> URL: https://issues.apache.org/jira/browse/HDFS-16322
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namanode
> Environment: The runtime environment is Ubuntu 18.04, Java 1.8.0_222 
> and Apache Maven 3.6.0. 
> The bug can be reproduced by the the testMultipleTruncate() in the 
> attachment. First, replace the file TestFileTruncate.java under the directory 
> "hadoop-3.3.1-src/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/"
>  with the attachment. Then run "mvn test 
> -Dtest=org.apache.hadoop.hdfs.server.namenode.TestFileTruncate#testMultipleTruncate"
>  to run the testcase. Finally the "assertFileLength(p, n+newLength)" at 199 
> line of TestFileTruncate.java will abort. Because the retry of truncate() 
> changes the file size and cause data loss.
>Reporter: nhaorand
>Priority: Major
> Attachments: TestFileTruncate.java, h16322_2026.patch
>
>
> The NameNode implementation of ClientProtocol.truncate(...) can cause data 
> loss. If dfsclient drops the first response of a truncate RPC call, the retry 
> by retry cache will truncate the file again and cause data loss.
> HDFS-7926 avoids repeated execution of truncate(...) by checking if the file 
> is already being truncated with the same length. However, under concurrency, 
> after the first execution of truncate(...), concurrent requests from other 
> clients may append new data and change the file length. When truncate(...) is 
> retried after that, it will find the file has not been truncated with the 
> same length and truncate it again, which causes data loss.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15113) Missing IBR when NameNode restart if open processCommand async feature

2020-03-05 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15113:
---
Attachment: HDFS-15113.004.patch

> Missing IBR when NameNode restart if open processCommand async feature
> --
>
> Key: HDFS-15113
> URL: https://issues.apache.org/jira/browse/HDFS-15113
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Blocker
> Attachments: HDFS-15113.001.patch, HDFS-15113.002.patch, 
> HDFS-15113.003.patch, HDFS-15113.004.patch
>
>
> Recently, I meet one case that NameNode missing block after restart which is 
> related with HDFS-14997.
> a. during NameNode restart, it will return command `DNA_REGISTER` to DataNode 
> when receive some RPC request from DataNode.
> b. when DataNode receive `DNA_REGISTER` command, it will run #reRegister 
> async.
> {code:java}
>   void reRegister() throws IOException {
> if (shouldRun()) {
>   // re-retrieve namespace info to make sure that, if the NN
>   // was restarted, we still match its version (HDFS-2120)
>   NamespaceInfo nsInfo = retrieveNamespaceInfo();
>   // and re-register
>   register(nsInfo);
>   scheduler.scheduleHeartbeat();
>   // HDFS-9917,Standby NN IBR can be very huge if standby namenode is down
>   // for sometime.
>   if (state == HAServiceState.STANDBY || state == 
> HAServiceState.OBSERVER) {
> ibrManager.clearIBRs();
>   }
> }
>   }
> {code}
> c. As we know, #register will trigger BR immediately.
> d. because #reRegister run async, so we could not make sure which one run 
> first between send FBR and clear IBR. If clean IBR run first, it will be OK. 
> But if send FBR first then clear IBR, it will missing some blocks received 
> between these two time point until next FBR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15113) Missing IBR when NameNode restart if open processCommand async feature

2020-03-05 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051909#comment-17051909
 ] 

Xiaoqiao He commented on HDFS-15113:


Sorry for late response due to long time vacation.
submit v004 and try to add unit test to cover this case. Please give another 
review. cc [~elgoiri],[~weichiu],[~brahmareddy]. Thanks all.

> Missing IBR when NameNode restart if open processCommand async feature
> --
>
> Key: HDFS-15113
> URL: https://issues.apache.org/jira/browse/HDFS-15113
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Blocker
> Attachments: HDFS-15113.001.patch, HDFS-15113.002.patch, 
> HDFS-15113.003.patch, HDFS-15113.004.patch
>
>
> Recently, I meet one case that NameNode missing block after restart which is 
> related with HDFS-14997.
> a. during NameNode restart, it will return command `DNA_REGISTER` to DataNode 
> when receive some RPC request from DataNode.
> b. when DataNode receive `DNA_REGISTER` command, it will run #reRegister 
> async.
> {code:java}
>   void reRegister() throws IOException {
> if (shouldRun()) {
>   // re-retrieve namespace info to make sure that, if the NN
>   // was restarted, we still match its version (HDFS-2120)
>   NamespaceInfo nsInfo = retrieveNamespaceInfo();
>   // and re-register
>   register(nsInfo);
>   scheduler.scheduleHeartbeat();
>   // HDFS-9917,Standby NN IBR can be very huge if standby namenode is down
>   // for sometime.
>   if (state == HAServiceState.STANDBY || state == 
> HAServiceState.OBSERVER) {
> ibrManager.clearIBRs();
>   }
> }
>   }
> {code}
> c. As we know, #register will trigger BR immediately.
> d. because #reRegister run async, so we could not make sure which one run 
> first between send FBR and clear IBR. If clean IBR run first, it will be OK. 
> But if send FBR first then clear IBR, it will missing some blocks received 
> between these two time point until next FBR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15205) FSImage sort section logic is wrong

2020-03-05 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051941#comment-17051941
 ] 

Xiaoqiao He commented on HDFS-15205:


[~angerszhuuu], Thanks for your report. I try to recall test cases in my 
internal branch and it is very same to that using native logic(without this 
patch) to parse new FSImage format. If that, it is expected in my own opinion. 
HDFS-14771 and HDFS-14617 both mark release notes with the incompatible 
warning. The root cause is show as [this 
comment|https://issues.apache.org/jira/browse/HDFS-14771?focusedCommentId=16921585&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16921585].
It is better to show case that how to reproduce if not same as mentioned above.
Thanks [~angerszhuuu] again.

> FSImage sort section logic is wrong
> ---
>
> Key: HDFS-15205
> URL: https://issues.apache.org/jira/browse/HDFS-15205
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: angerszhu
>Priority: Blocker
> Attachments: HDFS-15205.001.patch
>
>
> When load FSImage, it will sort sections in FileSummary and load Section's in 
> SectionName enum sequence. But the sort method is wrong , when I use 
> branch-2.6.0 to load fsimage write by branch-2 with patch  
> https://issues.apache.org/jira/browse/HDFS-14771, it will throw NPE because 
> it load INODE first 
> {code:java}
> 2020-03-03 14:33:26,618 ERROR 
> org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadPermission(FSImageFormatPBINode.java:101)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeDirectory(FSImageFormatPBINode.java:148)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadRootINode(FSImageFormatPBINode.java:332)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeSection(FSImageFormatPBINode.java:218)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:254)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:180)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:226)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:1036)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:1020)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:741)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:677)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:290)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1092)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:780)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:609)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:666)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:838)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:817)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1538)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1606)
> {code}
> I print the load  order:
> {code:java}
> 2020-03-03 15:49:36,424 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: [name = INODE,  
> offset = 37, length = 11790829 ]
> 2020-03-03 15:49:36,424 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: [name = 
> INODE_SUB,  offset = 37, length = 826591 ]
> 2020-03-03 15:49:36,424 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: [name = 
> INODE_SUB,  offset = 826628, length = 828192 ]
> 2020-03-03 15:49:36,424 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: [name = 
> INODE_SUB,  offset = 1654820, length = 835240 ]
> 2020-03-03 15:49:36,424 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: [name = 
> INODE_SUB,  offset = 2490060, length = 833630 ]
> 2020-03-03 15:49:36,424 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: [name = 
> INODE_SUB,  offset = 3323690, length = 909445 ]
> 2020-03-03 15:49:36,424 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: [name = 
> INODE_SUB,  offset = 4233135, length = 866147 ]
> 2020-03-03 15:49:36,424 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: [name = 
> INODE_SUB

[jira] [Comment Edited] (HDFS-15205) FSImage sort section logic is wrong

2020-03-05 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051941#comment-17051941
 ] 

Xiaoqiao He edited comment on HDFS-15205 at 3/5/20, 9:20 AM:
-

Thanks [~weichiu] involve me here.
[~angerszhuuu], Thanks for your report. I try to recall test cases in my 
internal branch and it is very same to that using native logic(without this 
patch) to parse new FSImage format. If that, it is expected in my own opinion. 
HDFS-14771 and HDFS-14617 both mark release notes with the incompatible 
warning. The root cause is show as [this 
comment|https://issues.apache.org/jira/browse/HDFS-14771?focusedCommentId=16921585&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16921585].
It is better to show case that how to reproduce if not same as mentioned above.
Thanks [~angerszhuuu] again.


was (Author: hexiaoqiao):
[~angerszhuuu], Thanks for your report. I try to recall test cases in my 
internal branch and it is very same to that using native logic(without this 
patch) to parse new FSImage format. If that, it is expected in my own opinion. 
HDFS-14771 and HDFS-14617 both mark release notes with the incompatible 
warning. The root cause is show as [this 
comment|https://issues.apache.org/jira/browse/HDFS-14771?focusedCommentId=16921585&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16921585].
It is better to show case that how to reproduce if not same as mentioned above.
Thanks [~angerszhuuu] again.

> FSImage sort section logic is wrong
> ---
>
> Key: HDFS-15205
> URL: https://issues.apache.org/jira/browse/HDFS-15205
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: angerszhu
>Priority: Blocker
> Attachments: HDFS-15205.001.patch
>
>
> When load FSImage, it will sort sections in FileSummary and load Section's in 
> SectionName enum sequence. But the sort method is wrong , when I use 
> branch-2.6.0 to load fsimage write by branch-2 with patch  
> https://issues.apache.org/jira/browse/HDFS-14771, it will throw NPE because 
> it load INODE first 
> {code:java}
> 2020-03-03 14:33:26,618 ERROR 
> org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadPermission(FSImageFormatPBINode.java:101)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeDirectory(FSImageFormatPBINode.java:148)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadRootINode(FSImageFormatPBINode.java:332)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeSection(FSImageFormatPBINode.java:218)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:254)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:180)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:226)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:1036)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:1020)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:741)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:677)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:290)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1092)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:780)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:609)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:666)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:838)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:817)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1538)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1606)
> {code}
> I print the load  order:
> {code:java}
> 2020-03-03 15:49:36,424 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: [name = INODE,  
> offset = 37, length = 11790829 ]
> 2020-03-03 15:49:36,424 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: [name = 
> INODE_SUB,  offset = 37, length = 826591 ]
> 2020-03-03 15:49:36,424 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: [name = 
> INODE_SUB,  offset = 826628, length = 8281

[jira] [Commented] (HDFS-15180) DataNode FsDatasetImpl Fine-Grained Locking via BlockPool.

2020-03-05 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051952#comment-17051952
 ] 

Xiaoqiao He commented on HDFS-15180:


[~zhuqi] Thanks for your proposal and involve me here.
It is very valuable suggestion, actually I [~sodonnell] have done some work, 
ref. HDFS-15150 introduce read write lock and HDFS-15160 is in progress 
currently. Of course, HDFS-14997 (please with HDFS-15113 together if backport) 
is another way to avoid heavy IO to impact interactive with NN.
Beside these works, I believe there are some other ways to split the global 
lock. My colleague [~Aiphag0] try touse {{BlockPoolLockManager}} to split 
{{dataLock}} more fine-grained. {{BlockPoolLockManager}} represents rwlock pool 
with many rwlocks and it is more convenient for different BlockPools and 
different Disks to acquire lock and improve parallel read and write. This work 
is nearly finished recently, and gray deploy in our produce cluster. HDFS-15000 
will trace this work. 
Thanks [~zhuqi] again.

>  DataNode FsDatasetImpl Fine-Grained Locking via BlockPool.
> ---
>
> Key: HDFS-15180
> URL: https://issues.apache.org/jira/browse/HDFS-15180
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
>
> Now the FsDatasetImpl datasetLock is heavy, when their are many namespaces in 
> big cluster. If we can split the FsDatasetImpl datasetLock via blockpool. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15113) Missing IBR when NameNode restart if open processCommand async feature

2020-03-05 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15113:
---
Attachment: HDFS-15113.005.patch

> Missing IBR when NameNode restart if open processCommand async feature
> --
>
> Key: HDFS-15113
> URL: https://issues.apache.org/jira/browse/HDFS-15113
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Blocker
> Attachments: HDFS-15113.001.patch, HDFS-15113.002.patch, 
> HDFS-15113.003.patch, HDFS-15113.004.patch, HDFS-15113.005.patch
>
>
> Recently, I meet one case that NameNode missing block after restart which is 
> related with HDFS-14997.
> a. during NameNode restart, it will return command `DNA_REGISTER` to DataNode 
> when receive some RPC request from DataNode.
> b. when DataNode receive `DNA_REGISTER` command, it will run #reRegister 
> async.
> {code:java}
>   void reRegister() throws IOException {
> if (shouldRun()) {
>   // re-retrieve namespace info to make sure that, if the NN
>   // was restarted, we still match its version (HDFS-2120)
>   NamespaceInfo nsInfo = retrieveNamespaceInfo();
>   // and re-register
>   register(nsInfo);
>   scheduler.scheduleHeartbeat();
>   // HDFS-9917,Standby NN IBR can be very huge if standby namenode is down
>   // for sometime.
>   if (state == HAServiceState.STANDBY || state == 
> HAServiceState.OBSERVER) {
> ibrManager.clearIBRs();
>   }
> }
>   }
> {code}
> c. As we know, #register will trigger BR immediately.
> d. because #reRegister run async, so we could not make sure which one run 
> first between send FBR and clear IBR. If clean IBR run first, it will be OK. 
> But if send FBR first then clear IBR, it will missing some blocks received 
> between these two time point until next FBR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15113) Missing IBR when NameNode restart if open processCommand async feature

2020-03-05 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052320#comment-17052320
 ] 

Xiaoqiao He commented on HDFS-15113:


[~brahmareddy] Thanks for your quick response.
submit v005 and fix checkstyle warning report by Jenkins.  It seems failed unit 
tests are related with this patch. Please help to double check. Thanks.

> Missing IBR when NameNode restart if open processCommand async feature
> --
>
> Key: HDFS-15113
> URL: https://issues.apache.org/jira/browse/HDFS-15113
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Blocker
> Attachments: HDFS-15113.001.patch, HDFS-15113.002.patch, 
> HDFS-15113.003.patch, HDFS-15113.004.patch, HDFS-15113.005.patch
>
>
> Recently, I meet one case that NameNode missing block after restart which is 
> related with HDFS-14997.
> a. during NameNode restart, it will return command `DNA_REGISTER` to DataNode 
> when receive some RPC request from DataNode.
> b. when DataNode receive `DNA_REGISTER` command, it will run #reRegister 
> async.
> {code:java}
>   void reRegister() throws IOException {
> if (shouldRun()) {
>   // re-retrieve namespace info to make sure that, if the NN
>   // was restarted, we still match its version (HDFS-2120)
>   NamespaceInfo nsInfo = retrieveNamespaceInfo();
>   // and re-register
>   register(nsInfo);
>   scheduler.scheduleHeartbeat();
>   // HDFS-9917,Standby NN IBR can be very huge if standby namenode is down
>   // for sometime.
>   if (state == HAServiceState.STANDBY || state == 
> HAServiceState.OBSERVER) {
> ibrManager.clearIBRs();
>   }
> }
>   }
> {code}
> c. As we know, #register will trigger BR immediately.
> d. because #reRegister run async, so we could not make sure which one run 
> first between send FBR and clear IBR. If clean IBR run first, it will be OK. 
> But if send FBR first then clear IBR, it will missing some blocks received 
> between these two time point until next FBR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15113) Missing IBR when NameNode restart if open processCommand async feature

2020-03-05 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052320#comment-17052320
 ] 

Xiaoqiao He edited comment on HDFS-15113 at 3/5/20, 4:40 PM:
-

[~brahmareddy] Thanks for your quick response.
submit v005 and fix checkstyle warning report by Jenkins.  It seems failed unit 
tests are not related with this patch. Please help to double check. Thanks.


was (Author: hexiaoqiao):
[~brahmareddy] Thanks for your quick response.
submit v005 and fix checkstyle warning report by Jenkins.  It seems failed unit 
tests are related with this patch. Please help to double check. Thanks.

> Missing IBR when NameNode restart if open processCommand async feature
> --
>
> Key: HDFS-15113
> URL: https://issues.apache.org/jira/browse/HDFS-15113
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Blocker
> Attachments: HDFS-15113.001.patch, HDFS-15113.002.patch, 
> HDFS-15113.003.patch, HDFS-15113.004.patch, HDFS-15113.005.patch
>
>
> Recently, I meet one case that NameNode missing block after restart which is 
> related with HDFS-14997.
> a. during NameNode restart, it will return command `DNA_REGISTER` to DataNode 
> when receive some RPC request from DataNode.
> b. when DataNode receive `DNA_REGISTER` command, it will run #reRegister 
> async.
> {code:java}
>   void reRegister() throws IOException {
> if (shouldRun()) {
>   // re-retrieve namespace info to make sure that, if the NN
>   // was restarted, we still match its version (HDFS-2120)
>   NamespaceInfo nsInfo = retrieveNamespaceInfo();
>   // and re-register
>   register(nsInfo);
>   scheduler.scheduleHeartbeat();
>   // HDFS-9917,Standby NN IBR can be very huge if standby namenode is down
>   // for sometime.
>   if (state == HAServiceState.STANDBY || state == 
> HAServiceState.OBSERVER) {
> ibrManager.clearIBRs();
>   }
> }
>   }
> {code}
> c. As we know, #register will trigger BR immediately.
> d. because #reRegister run async, so we could not make sure which one run 
> first between send FBR and clear IBR. If clean IBR run first, it will be OK. 
> But if send FBR first then clear IBR, it will missing some blocks received 
> between these two time point until next FBR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15113) Missing IBR when NameNode restart if open processCommand async feature

2020-03-05 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052719#comment-17052719
 ] 

Xiaoqiao He commented on HDFS-15113:


Thanks [~weichiu] for your reminder. I try to run other failed unit tests 
TestReconstructStripedFileWithRandomECPolicy and TestDataNodeUUID with v005 at 
local, it seems both of them are passed. Please help to have another reviews. 
Thanks. 

> Missing IBR when NameNode restart if open processCommand async feature
> --
>
> Key: HDFS-15113
> URL: https://issues.apache.org/jira/browse/HDFS-15113
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Blocker
> Attachments: HDFS-15113.001.patch, HDFS-15113.002.patch, 
> HDFS-15113.003.patch, HDFS-15113.004.patch, HDFS-15113.005.patch
>
>
> Recently, I meet one case that NameNode missing block after restart which is 
> related with HDFS-14997.
> a. during NameNode restart, it will return command `DNA_REGISTER` to DataNode 
> when receive some RPC request from DataNode.
> b. when DataNode receive `DNA_REGISTER` command, it will run #reRegister 
> async.
> {code:java}
>   void reRegister() throws IOException {
> if (shouldRun()) {
>   // re-retrieve namespace info to make sure that, if the NN
>   // was restarted, we still match its version (HDFS-2120)
>   NamespaceInfo nsInfo = retrieveNamespaceInfo();
>   // and re-register
>   register(nsInfo);
>   scheduler.scheduleHeartbeat();
>   // HDFS-9917,Standby NN IBR can be very huge if standby namenode is down
>   // for sometime.
>   if (state == HAServiceState.STANDBY || state == 
> HAServiceState.OBSERVER) {
> ibrManager.clearIBRs();
>   }
> }
>   }
> {code}
> c. As we know, #register will trigger BR immediately.
> d. because #reRegister run async, so we could not make sure which one run 
> first between send FBR and clear IBR. If clean IBR run first, it will be OK. 
> But if send FBR first then clear IBR, it will missing some blocks received 
> between these two time point until next FBR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15205) FSImage sort section logic is wrong

2020-03-05 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052734#comment-17052734
 ] 

Xiaoqiao He commented on HDFS-15205:


Thanks [~angerszhuuu] for your detailed comments. It looks that this sorting 
logic is wrong indeed. But I am not sure if correct it could solve incompatible 
issue completely. I would like to re-check compatibility in the next days.

> FSImage sort section logic is wrong
> ---
>
> Key: HDFS-15205
> URL: https://issues.apache.org/jira/browse/HDFS-15205
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: angerszhu
>Priority: Blocker
> Attachments: HDFS-15205.001.patch
>
>
> When load FSImage, it will sort sections in FileSummary and load Section's in 
> SectionName enum sequence. But the sort method is wrong , when I use 
> branch-2.6.0 to load fsimage write by branch-2 with patch  
> https://issues.apache.org/jira/browse/HDFS-14771, it will throw NPE because 
> it load INODE first 
> {code:java}
> 2020-03-03 14:33:26,618 ERROR 
> org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadPermission(FSImageFormatPBINode.java:101)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeDirectory(FSImageFormatPBINode.java:148)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadRootINode(FSImageFormatPBINode.java:332)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeSection(FSImageFormatPBINode.java:218)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:254)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:180)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:226)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:1036)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:1020)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:741)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:677)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:290)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1092)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:780)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:609)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:666)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:838)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:817)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1538)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1606)
> {code}
> I print the load  order:
> {code:java}
> 2020-03-03 15:49:36,424 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: [name = INODE,  
> offset = 37, length = 11790829 ]
> 2020-03-03 15:49:36,424 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: [name = 
> INODE_SUB,  offset = 37, length = 826591 ]
> 2020-03-03 15:49:36,424 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: [name = 
> INODE_SUB,  offset = 826628, length = 828192 ]
> 2020-03-03 15:49:36,424 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: [name = 
> INODE_SUB,  offset = 1654820, length = 835240 ]
> 2020-03-03 15:49:36,424 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: [name = 
> INODE_SUB,  offset = 2490060, length = 833630 ]
> 2020-03-03 15:49:36,424 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: [name = 
> INODE_SUB,  offset = 3323690, length = 909445 ]
> 2020-03-03 15:49:36,424 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: [name = 
> INODE_SUB,  offset = 4233135, length = 866147 ]
> 2020-03-03 15:49:36,424 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: [name = 
> INODE_SUB,  offset = 5099282, length = 1272751 ]
> 2020-03-03 15:49:36,424 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: [name = 
> INODE_SUB,  offset = 6372033, length = 1311876 ]
> 2020-03-03 15:49:36,424 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: [name = 
> INODE_SUB,  offset = 7683909, length = 1251510 ]
> 2020-03-03 15:49:36,424 INFO 
> org.apach

[jira] [Commented] (HDFS-15205) FSImage sort section logic is wrong

2020-03-07 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053968#comment-17053968
 ] 

Xiaoqiao He commented on HDFS-15205:


Hi [~sodonnell],[~angerszhuuu], After re-check and dig code, I believe this 
case is not related with this weird logic in the comparator mentioned above. Of 
course, I think it may be BUG, we should fix that in my opinion. Actually 
HDFS-14172 also mentioned this weird source segment, but not discuss deeply and 
fix it. I am not sure if there are some other consideration.
About this case (load FSImage failed using native logic(without patch) to parse 
new FSImage format), the root cause as [~sodonnell] mentioned above.
a. INODE_SUB and INODE_DIR_SUB both are not defined in SectionName enum in 
branch-2.6.
b.The following switch/case condition will return null at 
FSImageFormatProtobuf.Loader#loadInternal, then meet NPE.
{code:java}
switch (SectionName.fromString(n)) {
  ..
}
{code}
I try to add INODE_SUB and INODE_DIR_SUB to SectionName enum or determine if 
{{SectionName.fromString(n)}} is NULL before switch/case statement, and test 
again with my internal branch (based on branch-2.7), it looks both of them 
loading new FSImage format normally.

> FSImage sort section logic is wrong
> ---
>
> Key: HDFS-15205
> URL: https://issues.apache.org/jira/browse/HDFS-15205
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: angerszhu
>Priority: Blocker
> Attachments: HDFS-15205.001.patch
>
>
> When load FSImage, it will sort sections in FileSummary and load Section's in 
> SectionName enum sequence. But the sort method is wrong , when I use 
> branch-2.6.0 to load fsimage write by branch-2 with patch  
> https://issues.apache.org/jira/browse/HDFS-14771, it will throw NPE because 
> it load INODE first 
> {code:java}
> 2020-03-03 14:33:26,618 ERROR 
> org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadPermission(FSImageFormatPBINode.java:101)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeDirectory(FSImageFormatPBINode.java:148)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadRootINode(FSImageFormatPBINode.java:332)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeSection(FSImageFormatPBINode.java:218)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:254)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:180)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:226)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:1036)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:1020)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:741)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:677)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:290)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1092)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:780)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:609)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:666)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:838)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:817)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1538)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1606)
> {code}
> I print the load  order:
> {code:java}
> 2020-03-03 15:49:36,424 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: [name = INODE,  
> offset = 37, length = 11790829 ]
> 2020-03-03 15:49:36,424 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: [name = 
> INODE_SUB,  offset = 37, length = 826591 ]
> 2020-03-03 15:49:36,424 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: [name = 
> INODE_SUB,  offset = 826628, length = 828192 ]
> 2020-03-03 15:49:36,424 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: [name = 
> INODE_SUB,  offset = 1654820, length = 835240 ]
> 2020-03-03 15:49:36,424 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: [name = 
> INODE_SUB,  o

[jira] [Comment Edited] (HDFS-15205) FSImage sort section logic is wrong

2020-03-07 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053968#comment-17053968
 ] 

Xiaoqiao He edited comment on HDFS-15205 at 3/7/20, 10:21 AM:
--

Hi [~sodonnell],[~angerszhuuu], After re-check and dig code, I believe this 
case is not related with this weird logic in the comparator mentioned above, I 
mean it is not useful to correct failed loading new fsimage format. Of course, 
I think it may be BUG, we should fix that in my opinion. Actually HDFS-14172 
also mentioned this weird source segment, but not discuss deeply and fix it. I 
am not sure if there are some other consideration.
About this case (load FSImage failed using native logic(without patch) to parse 
new FSImage format), the root cause as [~sodonnell] mentioned above.
a. INODE_SUB and INODE_DIR_SUB both are not defined in SectionName enum in 
branch-2.6.
b.The following switch/case condition will return null at 
FSImageFormatProtobuf.Loader#loadInternal, then meet NPE.
{code:java}
switch (SectionName.fromString(n)) {
  ..
}
{code}
I try to add INODE_SUB and INODE_DIR_SUB to SectionName enum or determine if 
{{SectionName.fromString(n)}} is NULL before switch/case statement, and test 
again with my internal branch (based on branch-2.7), it looks both of them 
loading new FSImage format normally.


was (Author: hexiaoqiao):
Hi [~sodonnell],[~angerszhuuu], After re-check and dig code, I believe this 
case is not related with this weird logic in the comparator mentioned above. Of 
course, I think it may be BUG, we should fix that in my opinion. Actually 
HDFS-14172 also mentioned this weird source segment, but not discuss deeply and 
fix it. I am not sure if there are some other consideration.
About this case (load FSImage failed using native logic(without patch) to parse 
new FSImage format), the root cause as [~sodonnell] mentioned above.
a. INODE_SUB and INODE_DIR_SUB both are not defined in SectionName enum in 
branch-2.6.
b.The following switch/case condition will return null at 
FSImageFormatProtobuf.Loader#loadInternal, then meet NPE.
{code:java}
switch (SectionName.fromString(n)) {
  ..
}
{code}
I try to add INODE_SUB and INODE_DIR_SUB to SectionName enum or determine if 
{{SectionName.fromString(n)}} is NULL before switch/case statement, and test 
again with my internal branch (based on branch-2.7), it looks both of them 
loading new FSImage format normally.

> FSImage sort section logic is wrong
> ---
>
> Key: HDFS-15205
> URL: https://issues.apache.org/jira/browse/HDFS-15205
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: angerszhu
>Priority: Blocker
> Attachments: HDFS-15205.001.patch
>
>
> When load FSImage, it will sort sections in FileSummary and load Section's in 
> SectionName enum sequence. But the sort method is wrong , when I use 
> branch-2.6.0 to load fsimage write by branch-2 with patch  
> https://issues.apache.org/jira/browse/HDFS-14771, it will throw NPE because 
> it load INODE first 
> {code:java}
> 2020-03-03 14:33:26,618 ERROR 
> org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadPermission(FSImageFormatPBINode.java:101)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeDirectory(FSImageFormatPBINode.java:148)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadRootINode(FSImageFormatPBINode.java:332)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeSection(FSImageFormatPBINode.java:218)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:254)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:180)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:226)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:1036)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:1020)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:741)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:677)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:290)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1092)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:780)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.l

[jira] [Commented] (HDFS-15207) VolumeScanner skip to scan blocks accessed during recent scan peroid

2020-03-09 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17055576#comment-17055576
 ] 

Xiaoqiao He commented on HDFS-15207:


Thanks [~hadoop_yangyun] for your works. This improvement seems to depend on 
access time of local file attribution. Any case that access this file from 
external process but DataNode? And there are also some random reads to get 
local file attribution. Is it possible to add {{lastScanTime}} for ReplicaInfo 
to determine if scan or skip in the next time? Of cause, it will occupy extra 
heap memory. Thanks again.

> VolumeScanner skip to scan blocks accessed during recent scan peroid
> 
>
> Key: HDFS-15207
> URL: https://issues.apache.org/jira/browse/HDFS-15207
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Yang Yun
>Assignee: Yang Yun
>Priority: Minor
> Attachments: HDFS-15207.002.patch, HDFS-15207.003.patch, 
> HDFS-15207.patch, HDFS-15207.patch
>
>
> Check the access time of block file to avoid scanning recently changed 
> blocks, reducing disk IO.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15113) Missing IBR when NameNode restart if open processCommand async feature

2020-03-09 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17055579#comment-17055579
 ] 

Xiaoqiao He commented on HDFS-15113:


[~weichiu],[~elgoiri],[~brahmareddy] Hi guys, any furthermore comments or 
should we step forward here? Thanks.

> Missing IBR when NameNode restart if open processCommand async feature
> --
>
> Key: HDFS-15113
> URL: https://issues.apache.org/jira/browse/HDFS-15113
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Blocker
> Attachments: HDFS-15113.001.patch, HDFS-15113.002.patch, 
> HDFS-15113.003.patch, HDFS-15113.004.patch, HDFS-15113.005.patch
>
>
> Recently, I meet one case that NameNode missing block after restart which is 
> related with HDFS-14997.
> a. during NameNode restart, it will return command `DNA_REGISTER` to DataNode 
> when receive some RPC request from DataNode.
> b. when DataNode receive `DNA_REGISTER` command, it will run #reRegister 
> async.
> {code:java}
>   void reRegister() throws IOException {
> if (shouldRun()) {
>   // re-retrieve namespace info to make sure that, if the NN
>   // was restarted, we still match its version (HDFS-2120)
>   NamespaceInfo nsInfo = retrieveNamespaceInfo();
>   // and re-register
>   register(nsInfo);
>   scheduler.scheduleHeartbeat();
>   // HDFS-9917,Standby NN IBR can be very huge if standby namenode is down
>   // for sometime.
>   if (state == HAServiceState.STANDBY || state == 
> HAServiceState.OBSERVER) {
> ibrManager.clearIBRs();
>   }
> }
>   }
> {code}
> c. As we know, #register will trigger BR immediately.
> d. because #reRegister run async, so we could not make sure which one run 
> first between send FBR and clear IBR. If clean IBR run first, it will be OK. 
> But if send FBR first then clear IBR, it will missing some blocks received 
> between these two time point until next FBR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15160) ReplicaMap, Disk Balancer, Directory Scanner and various FsDatasetImpl methods should use datanode readlock

2020-03-10 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17055839#comment-17055839
 ] 

Xiaoqiao He commented on HDFS-15160:


Thanks [~sodonnell] for your works, it is great improvement for DataNode based 
on monitor chart offered by [~zhuqi]. [^HDFS-15160.002.patch] almost LGTM, just 
one minor comment,  DataNode#transferReplicaForPipelineRecovery also hold 
FsDataset write lock currently, it also could change to read lock in my 
opinion. Would you like to do double check?

> ReplicaMap, Disk Balancer, Directory Scanner and various FsDatasetImpl 
> methods should use datanode readlock
> ---
>
> Key: HDFS-15160
> URL: https://issues.apache.org/jira/browse/HDFS-15160
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.0
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
> Attachments: HDFS-15160.001.patch, HDFS-15160.002.patch
>
>
> Now we have HDFS-15150, we can start to move some DN operations to use the 
> read lock rather than the write lock to improve concurrence. The first step 
> is to make the changes to ReplicaMap, as many other methods make calls to it.
> This Jira switches read operations against the volume map to use the readLock 
> rather than the write lock.
> Additionally, some methods make a call to replicaMap.replicas() (eg 
> getBlockReports, getFinalizedBlocks, deepCopyReplica) and only use the result 
> in a read only fashion, so they can also be switched to using a readLock.
> Next is the directory scanner and disk balancer, which only require a read 
> lock.
> Finally (for this Jira) are various "low hanging fruit" items in BlockSender 
> and fsdatasetImpl where is it fairly obvious they only need a read lock.
> For now, I have avoided changing anything which looks too risky, as I think 
> its better to do any larger refactoring or risky changes each in their own 
> Jira.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15113) Missing IBR when NameNode restart if open processCommand async feature

2020-03-10 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17056635#comment-17056635
 ] 

Xiaoqiao He commented on HDFS-15113:


Thanks [~brahmareddy],[~elgoiri] for your reviews.
[~weichiu] Would you have bandwidth to review or commit it? Thanks.

> Missing IBR when NameNode restart if open processCommand async feature
> --
>
> Key: HDFS-15113
> URL: https://issues.apache.org/jira/browse/HDFS-15113
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Blocker
> Attachments: HDFS-15113.001.patch, HDFS-15113.002.patch, 
> HDFS-15113.003.patch, HDFS-15113.004.patch, HDFS-15113.005.patch
>
>
> Recently, I meet one case that NameNode missing block after restart which is 
> related with HDFS-14997.
> a. during NameNode restart, it will return command `DNA_REGISTER` to DataNode 
> when receive some RPC request from DataNode.
> b. when DataNode receive `DNA_REGISTER` command, it will run #reRegister 
> async.
> {code:java}
>   void reRegister() throws IOException {
> if (shouldRun()) {
>   // re-retrieve namespace info to make sure that, if the NN
>   // was restarted, we still match its version (HDFS-2120)
>   NamespaceInfo nsInfo = retrieveNamespaceInfo();
>   // and re-register
>   register(nsInfo);
>   scheduler.scheduleHeartbeat();
>   // HDFS-9917,Standby NN IBR can be very huge if standby namenode is down
>   // for sometime.
>   if (state == HAServiceState.STANDBY || state == 
> HAServiceState.OBSERVER) {
> ibrManager.clearIBRs();
>   }
> }
>   }
> {code}
> c. As we know, #register will trigger BR immediately.
> d. because #reRegister run async, so we could not make sure which one run 
> first between send FBR and clear IBR. If clean IBR run first, it will be OK. 
> But if send FBR first then clear IBR, it will missing some blocks received 
> between these two time point until next FBR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15180) DataNode FsDatasetImpl Fine-Grained Locking via BlockPool.

2020-03-13 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He reassigned HDFS-15180:
--

Assignee: Aiphago  (was: zhuqi)

>  DataNode FsDatasetImpl Fine-Grained Locking via BlockPool.
> ---
>
> Key: HDFS-15180
> URL: https://issues.apache.org/jira/browse/HDFS-15180
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: Aiphago
>Priority: Major
> Attachments: HDFS-15180.001.patch, image-2020-03-10-17-22-57-391.png, 
> image-2020-03-10-17-31-58-830.png, image-2020-03-10-17-34-26-368.png
>
>
> Now the FsDatasetImpl datasetLock is heavy, when their are many namespaces in 
> big cluster. If we can split the FsDatasetImpl datasetLock via blockpool. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15180) DataNode FsDatasetImpl Fine-Grained Locking via BlockPool.

2020-03-13 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17059203#comment-17059203
 ] 

Xiaoqiao He commented on HDFS-15180:


Thanks [~Aiphag0] for your works and the POC patch.
Hi [~zhuqi], I just assign this JIRA to [~Aiphag0], please feel free to assign 
back if you are interested to work with [~Aiphag0] together.
cc [~sodonnell],[~zhuqi] any suggestions are welcome here and look forward to 
hear your feedback and comments for the POC solution. Thanks.

>  DataNode FsDatasetImpl Fine-Grained Locking via BlockPool.
> ---
>
> Key: HDFS-15180
> URL: https://issues.apache.org/jira/browse/HDFS-15180
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: Aiphago
>Priority: Major
> Attachments: HDFS-15180.001.patch, image-2020-03-10-17-22-57-391.png, 
> image-2020-03-10-17-31-58-830.png, image-2020-03-10-17-34-26-368.png
>
>
> Now the FsDatasetImpl datasetLock is heavy, when their are many namespaces in 
> big cluster. If we can split the FsDatasetImpl datasetLock via blockpool. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15113) Missing IBR when NameNode restart if open processCommand async feature

2020-03-15 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15113:
---
Attachment: HDFS-15113.addendum.patch

> Missing IBR when NameNode restart if open processCommand async feature
> --
>
> Key: HDFS-15113
> URL: https://issues.apache.org/jira/browse/HDFS-15113
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Blocker
> Fix For: 3.3.0
>
> Attachments: HDFS-15113.001.patch, HDFS-15113.002.patch, 
> HDFS-15113.003.patch, HDFS-15113.004.patch, HDFS-15113.005.patch, 
> HDFS-15113.addendum.patch
>
>
> Recently, I meet one case that NameNode missing block after restart which is 
> related with HDFS-14997.
> a. during NameNode restart, it will return command `DNA_REGISTER` to DataNode 
> when receive some RPC request from DataNode.
> b. when DataNode receive `DNA_REGISTER` command, it will run #reRegister 
> async.
> {code:java}
>   void reRegister() throws IOException {
> if (shouldRun()) {
>   // re-retrieve namespace info to make sure that, if the NN
>   // was restarted, we still match its version (HDFS-2120)
>   NamespaceInfo nsInfo = retrieveNamespaceInfo();
>   // and re-register
>   register(nsInfo);
>   scheduler.scheduleHeartbeat();
>   // HDFS-9917,Standby NN IBR can be very huge if standby namenode is down
>   // for sometime.
>   if (state == HAServiceState.STANDBY || state == 
> HAServiceState.OBSERVER) {
> ibrManager.clearIBRs();
>   }
> }
>   }
> {code}
> c. As we know, #register will trigger BR immediately.
> d. because #reRegister run async, so we could not make sure which one run 
> first between send FBR and clear IBR. If clean IBR run first, it will be OK. 
> But if send FBR first then clear IBR, it will missing some blocks received 
> between these two time point until next FBR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15113) Missing IBR when NameNode restart if open processCommand async feature

2020-03-15 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17059699#comment-17059699
 ] 

Xiaoqiao He commented on HDFS-15113:


Thanks [~weichiu] for your comments and sorry for missing this JIRA. 
[^HDFS-15113.addendum.patch] try to fix it following review comments. PTAL.

> Missing IBR when NameNode restart if open processCommand async feature
> --
>
> Key: HDFS-15113
> URL: https://issues.apache.org/jira/browse/HDFS-15113
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Blocker
> Fix For: 3.3.0
>
> Attachments: HDFS-15113.001.patch, HDFS-15113.002.patch, 
> HDFS-15113.003.patch, HDFS-15113.004.patch, HDFS-15113.005.patch, 
> HDFS-15113.addendum.patch
>
>
> Recently, I meet one case that NameNode missing block after restart which is 
> related with HDFS-14997.
> a. during NameNode restart, it will return command `DNA_REGISTER` to DataNode 
> when receive some RPC request from DataNode.
> b. when DataNode receive `DNA_REGISTER` command, it will run #reRegister 
> async.
> {code:java}
>   void reRegister() throws IOException {
> if (shouldRun()) {
>   // re-retrieve namespace info to make sure that, if the NN
>   // was restarted, we still match its version (HDFS-2120)
>   NamespaceInfo nsInfo = retrieveNamespaceInfo();
>   // and re-register
>   register(nsInfo);
>   scheduler.scheduleHeartbeat();
>   // HDFS-9917,Standby NN IBR can be very huge if standby namenode is down
>   // for sometime.
>   if (state == HAServiceState.STANDBY || state == 
> HAServiceState.OBSERVER) {
> ibrManager.clearIBRs();
>   }
> }
>   }
> {code}
> c. As we know, #register will trigger BR immediately.
> d. because #reRegister run async, so we could not make sure which one run 
> first between send FBR and clear IBR. If clean IBR run first, it will be OK. 
> But if send FBR first then clear IBR, it will missing some blocks received 
> between these two time point until next FBR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15180) DataNode FsDatasetImpl Fine-Grained Locking via BlockPool.

2020-03-20 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15180:
---
Status: Patch Available  (was: Open)

>  DataNode FsDatasetImpl Fine-Grained Locking via BlockPool.
> ---
>
> Key: HDFS-15180
> URL: https://issues.apache.org/jira/browse/HDFS-15180
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: Aiphago
>Priority: Major
> Attachments: HDFS-15180.001.patch, HDFS-15180.002.patch, 
> image-2020-03-10-17-22-57-391.png, image-2020-03-10-17-31-58-830.png, 
> image-2020-03-10-17-34-26-368.png
>
>
> Now the FsDatasetImpl datasetLock is heavy, when their are many namespaces in 
> big cluster. If we can split the FsDatasetImpl datasetLock via blockpool. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15180) DataNode FsDatasetImpl Fine-Grained Locking via BlockPool.

2020-03-20 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063403#comment-17063403
 ] 

Xiaoqiao He commented on HDFS-15180:


Try to trigger Jenkins.

>  DataNode FsDatasetImpl Fine-Grained Locking via BlockPool.
> ---
>
> Key: HDFS-15180
> URL: https://issues.apache.org/jira/browse/HDFS-15180
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: Aiphago
>Priority: Major
> Attachments: HDFS-15180.001.patch, HDFS-15180.002.patch, 
> image-2020-03-10-17-22-57-391.png, image-2020-03-10-17-31-58-830.png, 
> image-2020-03-10-17-34-26-368.png
>
>
> Now the FsDatasetImpl datasetLock is heavy, when their are many namespaces in 
> big cluster. If we can split the FsDatasetImpl datasetLock via blockpool. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15180) DataNode FsDatasetImpl Fine-Grained Locking via BlockPool.

2020-03-21 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063805#comment-17063805
 ] 

Xiaoqiao He commented on HDFS-15180:


Not sure why Jenkins report not attach here. Just forward the result generated 
by Jenkins here and it seems there are unit tests failed, maybe it is related 
with this changes. [~Aiphag0] please take another check. Also some checkstyle 
warning need to fix. 
 -1 overall
|Vote|Subsystem|Runtime|Comment|
|0|reexec|25m 55s|Docker mode activated.|
| | | |Prechecks|
|+1|@author|0m 0s|The patch does not contain any @author|
| | | |tags.|
|+1|test4tests|0m 0s|The patch appears to include 9 new or|
| | | |modified test files.|
| | | |trunk Compile Tests|
|+1|mvninstall|19m 47s|trunk passed|
|+1|compile|1m 0s|trunk passed|
|+1|checkstyle|0m 48s|trunk passed|
|+1|mvnsite|1m 6s|trunk passed|
|+1|shadedclient|16m 8s|branch has no errors when building and|
| | | |testing our client artifacts.|
|+1|findbugs|2m 47s|trunk passed|
|+1|javadoc|0m 39s|trunk passed|
| | | |Patch Compile Tests|
|+1|mvninstall|1m 2s|the patch passed|
|+1|compile|0m 56s|the patch passed|
|-1|javac|0m 56s|hadoop-hdfs-project_hadoop-hdfs|
| | | |generated 1 new + 585 unchanged - 0|
| | | |fixed = 586 total (was 585)|
|-0|checkstyle|0m 43s|hadoop-hdfs-project/hadoop-hdfs: The|
| | | |patch generated 33 new + 460 unchanged|
| | | | - 1 fixed = 493 total (was 461)|
|+1|mvnsite|1m 3s|the patch passed|
|+1|whitespace|0m 0s|The patch has no whitespace issues.|
|+1|shadedclient|14m 6s|patch has no errors when building and|
| | | |testing our client artifacts.|
|-1|findbugs|2m 57s|hadoop-hdfs-project/hadoop-hdfs|
| | | |generated 1 new + 0 unchanged - 0 fixed|
| | | |= 1 total (was 0)|
|+1|javadoc|0m 37s|the patch passed|
| | | |Other Tests|
|-1|unit|110m 35s|hadoop-hdfs in the patch passed.|
|+1|asflicense|0m 38s|The patch does not generate ASF|
| | | |License warnings.|
| | |200m 51s|
||Reason||Tests||
|FindBugs|module:hadoop-hdfs-project/hadoop-hdfs 
 Should org.apache.hadoop.hdfs.server.datanode.BlockPoolLockManager$TrackLog be 
a _static_ inner class? At BlockPoolLockManager.java:inner class? At 
BlockPoolLockManager.java:[lines 62-92]|
|Failed junit tests|hadoop.hdfs.server.datanode.TestBlockPoolLockManager|
| |hadoop.hdfs.server.namenode.ha.TestHAFsck|
| |hadoop.hdfs.server.datanode.TestBPOfferService|
| |hadoop.hdfs.TestDecommissionWithStriped|
| |hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery|
| |hadoop.hdfs.server.datanode.TestBlockRecovery|
||Subsystem||Report/Notes||
|Docker|Client=19.03.8 Server=19.03.8 Image:yetus/hadoop:367833cf417|
|JIRA Issue|HDFS-15180|
|JIRA Patch URL|[^HDFS-15180.002.patch]|
|Optional Tests|dupname asflicense compile javac javadoc mvninstall mvnsite 
unit shadedclient findbugs checkstyle|
|uname|Linux 590078cb7ea7 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 08:06:28 
UTC 2019 x86_64 x86_64 x86_64 GNU/Linux|
|Build tool|maven|
|Personality|/testptch/patchprocess/precommit/personality/provided.sh|
|git revision|trunk / 36123170381|
|maven|version: Apache Maven 3.6.0|
|Default Java|1.8.0_242|
|findbugs|v3.1.0-RC1|
|javac|[https://builds.apache.org/job/PreCommit-HDFS-Build/28991/artifact/out/diff-compile-javac-hadoop-hdfs-project_hadoop-hdfs.txt]|
|checkstyle|[https://builds.apache.org/job/PreCommit-HDFS-Build/28991/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt]|
|findbugs|[https://builds.apache.org/job/PreCommit-HDFS-Build/28991/artifact/out/new-findbugs-hadoop-hdfs-project_hadoop-hdfs.html]|
|unit|[https://builds.apache.org/job/PreCommit-HDFS-Build/28991/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt]|
|Test 
Results|[https://builds.apache.org/job/PreCommit-HDFS-Build/28991/testReport/]|
|Max. process+thread count|2998 (vs. ulimit of 5500)|
|modules|C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs|
|Console 
output|[https://builds.apache.org/job/PreCommit-HDFS-Build/28991/console]|
|Powered by|Apache Yetus 0.8.0 
[http://yetus.apache.org|http://yetus.apache.org/]|

>  DataNode FsDatasetImpl Fine-Grained Locking via BlockPool.
> ---
>
> Key: HDFS-15180
> URL: https://issues.apache.org/jira/browse/HDFS-15180
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: Aiphago
>Priority: Major
> Attachments: HDFS-15180.001.patch, HDFS-15180.002.patch, 
> HDFS-15180.003.patch, image-2020-03-10-17-22-57-391.png, 
> image-2020-03-10-17-31-58-830.png, image-2020-03-10-17-34-26-368.png
>
>
> Now the FsDatasetImpl datasetLock is heavy, when their are many namespaces in 
> big cluster. If we can split the FsDatasetImpl datasetLock via blockpool. 



--
This message was sent by At

[jira] [Comment Edited] (HDFS-15180) DataNode FsDatasetImpl Fine-Grained Locking via BlockPool.

2020-03-21 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063805#comment-17063805
 ] 

Xiaoqiao He edited comment on HDFS-15180 at 3/21/20, 7:57 AM:
--

Not sure why Jenkins report not attach here. Just forward the result generated 
by Jenkins here and it seems there are unit tests failed, maybe it is related 
with this changes. [~Aiphag0] please take another check. Also some checkstyle 
warning need to fix. 
 -1 overall
|Vote|Subsystem|Runtime|Comment|
|0|reexec|25m 55s|Docker mode activated.|
| | | |Prechecks|
|+1|@author|0m 0s|The patch does not contain any @author|
| | | |tags.|
|+1|test4tests|0m 0s|The patch appears to include 9 new or|
| | | |modified test files.|
| | | |trunk Compile Tests|
|+1|mvninstall|19m 47s|trunk passed|
|+1|compile|1m 0s|trunk passed|
|+1|checkstyle|0m 48s|trunk passed|
|+1|mvnsite|1m 6s|trunk passed|
|+1|shadedclient|16m 8s|branch has no errors when building and|
| | | |testing our client artifacts.|
|+1|findbugs|2m 47s|trunk passed|
|+1|javadoc|0m 39s|trunk passed|
| | | |Patch Compile Tests|
|+1|mvninstall|1m 2s|the patch passed|
|+1|compile|0m 56s|the patch passed|
|-1|javac|0m 56s|hadoop-hdfs-project_hadoop-hdfs|
| | | |generated 1 new + 585 unchanged - 0|
| | | |fixed = 586 total (was 585)|
|-0|checkstyle|0m 43s|hadoop-hdfs-project/hadoop-hdfs: The|
| | | |patch generated 33 new + 460 unchanged|
| | | | - 1 fixed = 493 total (was 461)|
|+1|mvnsite|1m 3s|the patch passed|
|+1|whitespace|0m 0s|The patch has no whitespace issues.|
|+1|shadedclient|14m 6s|patch has no errors when building and|
| | | |testing our client artifacts.|
|-1|findbugs|2m 57s|hadoop-hdfs-project/hadoop-hdfs|
| | | |generated 1 new + 0 unchanged - 0 fixed|
| | | |= 1 total (was 0)|
|+1|javadoc|0m 37s|the patch passed|
| | | |Other Tests|
|-1|unit|110m 35s|hadoop-hdfs in the patch passed.|
|+1|asflicense|0m 38s|The patch does not generate ASF|
| | | |License warnings.|
| | |200m 51s| |

||Reason||Tests||
|FindBugs|module:hadoop-hdfs-project/hadoop-hdfs 
 Should org.apache.hadoop.hdfs.server.datanode.BlockPoolLockManager$TrackLog be 
a _static_ inner class? At BlockPoolLockManager.java:inner class? At 
BlockPoolLockManager.java:[lines 62-92]|
|Failed junit tests|hadoop.hdfs.server.datanode.TestBlockPoolLockManager|
| |hadoop.hdfs.server.namenode.ha.TestHAFsck|
| |hadoop.hdfs.server.datanode.TestBPOfferService|
| |hadoop.hdfs.TestDecommissionWithStriped|
| |hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery|
| |hadoop.hdfs.server.datanode.TestBlockRecovery|

||Subsystem||Report/Notes||
|Docker|Client=19.03.8 Server=19.03.8 Image:yetus/hadoop:367833cf417|
|JIRA Issue|HDFS-15180|
|JIRA Patch URL|[^HDFS-15180.002.patch]|
|Optional Tests|dupname asflicense compile javac javadoc mvninstall mvnsite 
unit shadedclient findbugs checkstyle|
|uname|Linux 590078cb7ea7 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 08:06:28 
UTC 2019 x86_64 x86_64 x86_64 GNU/Linux|
|Build tool|maven|
|Personality|/testptch/patchprocess/precommit/personality/provided.sh|
|git revision|trunk / 36123170381|
|maven|version: Apache Maven 3.6.0|
|Default Java|1.8.0_242|
|findbugs|v3.1.0-RC1|
|javac|[https://builds.apache.org/job/PreCommit-HDFS-Build/28991/artifact/out/diff-compile-javac-hadoop-hdfs-project_hadoop-hdfs.txt]|
|checkstyle|[https://builds.apache.org/job/PreCommit-HDFS-Build/28991/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt]|
|findbugs|[https://builds.apache.org/job/PreCommit-HDFS-Build/28991/artifact/out/new-findbugs-hadoop-hdfs-project_hadoop-hdfs.html]|
|unit|[https://builds.apache.org/job/PreCommit-HDFS-Build/28991/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt]|
|Test 
Results|[https://builds.apache.org/job/PreCommit-HDFS-Build/28991/testReport/]|
|Max. process+thread count|2998 (vs. ulimit of 5500)|
|modules|C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs|
|Console 
output|[https://builds.apache.org/job/PreCommit-HDFS-Build/28991/console]|
|Powered by|Apache Yetus 0.8.0 
[http://yetus.apache.org|http://yetus.apache.org/]|


was (Author: hexiaoqiao):
Not sure why Jenkins report not attach here. Just forward the result generated 
by Jenkins here and it seems there are unit tests failed, maybe it is related 
with this changes. [~Aiphag0] please take another check. Also some checkstyle 
warning need to fix. 
 -1 overall
|Vote|Subsystem|Runtime|Comment|
|0|reexec|25m 55s|Docker mode activated.|
| | | |Prechecks|
|+1|@author|0m 0s|The patch does not contain any @author|
| | | |tags.|
|+1|test4tests|0m 0s|The patch appears to include 9 new or|
| | | |modified test files.|
| | | |trunk Compile Tests|
|+1|mvninstall|19m 47s|trunk passed|
|+1|compile|1m 0s|trunk passed|
|+1|checkstyle|0m 48s|trunk passed|
|+1|mvnsite|1m 6s|trunk passed|
|+1|shadedclient|16m 8s|branch has no errors when building and|
| | | |te

[jira] [Updated] (HDFS-15075) Remove process command timing from BPServiceActor

2020-03-21 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15075:
---
Attachment: HDFS-15075.005.patch

> Remove process command timing from BPServiceActor
> -
>
> Key: HDFS-15075
> URL: https://issues.apache.org/jira/browse/HDFS-15075
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15075.001.patch, HDFS-15075.002.patch, 
> HDFS-15075.003.patch, HDFS-15075.004.patch, HDFS-15075.005.patch
>
>
> HDFS-14997 moved the command processing into async.
> Right now, we are checking the time to add to a queue.
> We should remove this one and maybe move the timing within the thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15075) Remove process command timing from BPServiceActor

2020-03-21 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063838#comment-17063838
 ] 

Xiaoqiao He commented on HDFS-15075:


Considering HDFS-15113 has pushed to trunk, should we continue to this 
improvement?
I try to rebase and upload new patch v005. Please help to review if have 
bandwidth. Thanks. [~elgoiri]

> Remove process command timing from BPServiceActor
> -
>
> Key: HDFS-15075
> URL: https://issues.apache.org/jira/browse/HDFS-15075
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15075.001.patch, HDFS-15075.002.patch, 
> HDFS-15075.003.patch, HDFS-15075.004.patch, HDFS-15075.005.patch
>
>
> HDFS-14997 moved the command processing into async.
> Right now, we are checking the time to add to a queue.
> We should remove this one and maybe move the timing within the thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15169) RBF: Router FSCK should consider the mount table

2020-03-21 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15169:
---
Attachment: HDFS-15169.001.patch
  Assignee: Xiaoqiao He
Status: Patch Available  (was: Open)

submit v001 patch and try to trigger Jenkins.
[~aajisaka] I just assign this JIRA to myself for the following works. Please 
feel free to assign back if you would like to work for this one.

> RBF: Router FSCK should consider the mount table
> 
>
> Key: HDFS-15169
> URL: https://issues.apache.org/jira/browse/HDFS-15169
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Akira Ajisaka
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15169.001.patch
>
>
> HDFS-13989 implemented FSCK to DFSRouter, however, it just redirects the 
> requests to all the active downstream NameNodes for now. The DFSRouter should 
> consider the mount table when redirecting the requests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15051) RBF: Propose to revoke WRITE MountTableEntry privilege to super user only

2020-03-22 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064211#comment-17064211
 ] 

Xiaoqiao He commented on HDFS-15051:


Thanks [~ayushtkn] pick up this JIRA.
{quote}If the immediate parent doesn't exist, the parent above is checked for 
WRITE permission only, IMO it should be EXECUTE only, If parent is there then 
we can check WRITE, else we can cosider it exists virtually and has required 
permissions, and move up normally.{quote}
This makes sense to me, I would like to update it in the next two days.

> RBF: Propose to revoke WRITE MountTableEntry privilege to super user only
> -
>
> Key: HDFS-15051
> URL: https://issues.apache.org/jira/browse/HDFS-15051
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15051.001.patch, HDFS-15051.002.patch, 
> HDFS-15051.003.patch, HDFS-15051.004.patch, HDFS-15051.005.patch, 
> HDFS-15051.006.patch, HDFS-15051.007.patch
>
>
> The current permission checker of #MountTableStoreImpl is not very restrict. 
> In some case, any user could add/update/remove MountTableEntry without the 
> expected permission checking.
> The following code segment try to check permission when operate 
> MountTableEntry, however mountTable object is from Client/RouterAdmin 
> {{MountTable mountTable = request.getEntry();}}, and user could pass any mode 
> which could bypass the permission checker.
> {code:java}
>   public void checkPermission(MountTable mountTable, FsAction access)
>   throws AccessControlException {
> if (isSuperUser()) {
>   return;
> }
> FsPermission mode = mountTable.getMode();
> if (getUser().equals(mountTable.getOwnerName())
> && mode.getUserAction().implies(access)) {
>   return;
> }
> if (isMemberOfGroup(mountTable.getGroupName())
> && mode.getGroupAction().implies(access)) {
>   return;
> }
> if (!getUser().equals(mountTable.getOwnerName())
> && !isMemberOfGroup(mountTable.getGroupName())
> && mode.getOtherAction().implies(access)) {
>   return;
> }
> throw new AccessControlException(
> "Permission denied while accessing mount table "
> + mountTable.getSourcePath()
> + ": user " + getUser() + " does not have " + access.toString()
> + " permissions.");
>   }
> {code}
> I just propose revoke WRITE MountTableEntry privilege to super user only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15051) RBF: Propose to revoke WRITE MountTableEntry privilege to super user only

2020-03-22 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064210#comment-17064210
 ] 

Xiaoqiao He commented on HDFS-15051:


Thanks [~ayushtkn] pick up this JIRA.
{quote}If the immediate parent doesn't exist, the parent above is checked for 
WRITE permission only, IMO it should be EXECUTE only, If parent is there then 
we can check WRITE, else we can cosider it exists virtually and has required 
permissions, and move up normally.{quote}
This makes sense to me, I would like to update it in the next two days.

> RBF: Propose to revoke WRITE MountTableEntry privilege to super user only
> -
>
> Key: HDFS-15051
> URL: https://issues.apache.org/jira/browse/HDFS-15051
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15051.001.patch, HDFS-15051.002.patch, 
> HDFS-15051.003.patch, HDFS-15051.004.patch, HDFS-15051.005.patch, 
> HDFS-15051.006.patch, HDFS-15051.007.patch
>
>
> The current permission checker of #MountTableStoreImpl is not very restrict. 
> In some case, any user could add/update/remove MountTableEntry without the 
> expected permission checking.
> The following code segment try to check permission when operate 
> MountTableEntry, however mountTable object is from Client/RouterAdmin 
> {{MountTable mountTable = request.getEntry();}}, and user could pass any mode 
> which could bypass the permission checker.
> {code:java}
>   public void checkPermission(MountTable mountTable, FsAction access)
>   throws AccessControlException {
> if (isSuperUser()) {
>   return;
> }
> FsPermission mode = mountTable.getMode();
> if (getUser().equals(mountTable.getOwnerName())
> && mode.getUserAction().implies(access)) {
>   return;
> }
> if (isMemberOfGroup(mountTable.getGroupName())
> && mode.getGroupAction().implies(access)) {
>   return;
> }
> if (!getUser().equals(mountTable.getOwnerName())
> && !isMemberOfGroup(mountTable.getGroupName())
> && mode.getOtherAction().implies(access)) {
>   return;
> }
> throw new AccessControlException(
> "Permission denied while accessing mount table "
> + mountTable.getSourcePath()
> + ": user " + getUser() + " does not have " + access.toString()
> + " permissions.");
>   }
> {code}
> I just propose revoke WRITE MountTableEntry privilege to super user only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15075) Remove process command timing from BPServiceActor

2020-03-22 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15075:
---
Attachment: HDFS-15075.006.patch

> Remove process command timing from BPServiceActor
> -
>
> Key: HDFS-15075
> URL: https://issues.apache.org/jira/browse/HDFS-15075
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15075.001.patch, HDFS-15075.002.patch, 
> HDFS-15075.003.patch, HDFS-15075.004.patch, HDFS-15075.005.patch, 
> HDFS-15075.006.patch
>
>
> HDFS-14997 moved the command processing into async.
> Right now, we are checking the time to add to a queue.
> We should remove this one and maybe move the timing within the thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Issue Comment Deleted] (HDFS-15051) RBF: Propose to revoke WRITE MountTableEntry privilege to super user only

2020-03-22 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15051:
---
Comment: was deleted

(was: Thanks [~ayushtkn] pick up this JIRA.
{quote}If the immediate parent doesn't exist, the parent above is checked for 
WRITE permission only, IMO it should be EXECUTE only, If parent is there then 
we can check WRITE, else we can cosider it exists virtually and has required 
permissions, and move up normally.{quote}
This makes sense to me, I would like to update it in the next two days.)

> RBF: Propose to revoke WRITE MountTableEntry privilege to super user only
> -
>
> Key: HDFS-15051
> URL: https://issues.apache.org/jira/browse/HDFS-15051
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15051.001.patch, HDFS-15051.002.patch, 
> HDFS-15051.003.patch, HDFS-15051.004.patch, HDFS-15051.005.patch, 
> HDFS-15051.006.patch, HDFS-15051.007.patch, HDFS-15051.008.patch
>
>
> The current permission checker of #MountTableStoreImpl is not very restrict. 
> In some case, any user could add/update/remove MountTableEntry without the 
> expected permission checking.
> The following code segment try to check permission when operate 
> MountTableEntry, however mountTable object is from Client/RouterAdmin 
> {{MountTable mountTable = request.getEntry();}}, and user could pass any mode 
> which could bypass the permission checker.
> {code:java}
>   public void checkPermission(MountTable mountTable, FsAction access)
>   throws AccessControlException {
> if (isSuperUser()) {
>   return;
> }
> FsPermission mode = mountTable.getMode();
> if (getUser().equals(mountTable.getOwnerName())
> && mode.getUserAction().implies(access)) {
>   return;
> }
> if (isMemberOfGroup(mountTable.getGroupName())
> && mode.getGroupAction().implies(access)) {
>   return;
> }
> if (!getUser().equals(mountTable.getOwnerName())
> && !isMemberOfGroup(mountTable.getGroupName())
> && mode.getOtherAction().implies(access)) {
>   return;
> }
> throw new AccessControlException(
> "Permission denied while accessing mount table "
> + mountTable.getSourcePath()
> + ": user " + getUser() + " does not have " + access.toString()
> + " permissions.");
>   }
> {code}
> I just propose revoke WRITE MountTableEntry privilege to super user only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15075) Remove process command timing from BPServiceActor

2020-03-22 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064243#comment-17064243
 ] 

Xiaoqiao He commented on HDFS-15075:


Thanks [~elgoiri] for your comments, v006 try to update following above 
suggestions,
a. add unit test to verify part of new metrics (which can be checked directly) 
but both of them.
b. update metrics document.

> Remove process command timing from BPServiceActor
> -
>
> Key: HDFS-15075
> URL: https://issues.apache.org/jira/browse/HDFS-15075
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15075.001.patch, HDFS-15075.002.patch, 
> HDFS-15075.003.patch, HDFS-15075.004.patch, HDFS-15075.005.patch, 
> HDFS-15075.006.patch
>
>
> HDFS-14997 moved the command processing into async.
> Right now, we are checking the time to add to a queue.
> We should remove this one and maybe move the timing within the thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15051) RBF: Propose to revoke WRITE MountTableEntry privilege to super user only

2020-03-22 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15051:
---
Attachment: HDFS-15051.008.patch

> RBF: Propose to revoke WRITE MountTableEntry privilege to super user only
> -
>
> Key: HDFS-15051
> URL: https://issues.apache.org/jira/browse/HDFS-15051
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15051.001.patch, HDFS-15051.002.patch, 
> HDFS-15051.003.patch, HDFS-15051.004.patch, HDFS-15051.005.patch, 
> HDFS-15051.006.patch, HDFS-15051.007.patch, HDFS-15051.008.patch
>
>
> The current permission checker of #MountTableStoreImpl is not very restrict. 
> In some case, any user could add/update/remove MountTableEntry without the 
> expected permission checking.
> The following code segment try to check permission when operate 
> MountTableEntry, however mountTable object is from Client/RouterAdmin 
> {{MountTable mountTable = request.getEntry();}}, and user could pass any mode 
> which could bypass the permission checker.
> {code:java}
>   public void checkPermission(MountTable mountTable, FsAction access)
>   throws AccessControlException {
> if (isSuperUser()) {
>   return;
> }
> FsPermission mode = mountTable.getMode();
> if (getUser().equals(mountTable.getOwnerName())
> && mode.getUserAction().implies(access)) {
>   return;
> }
> if (isMemberOfGroup(mountTable.getGroupName())
> && mode.getGroupAction().implies(access)) {
>   return;
> }
> if (!getUser().equals(mountTable.getOwnerName())
> && !isMemberOfGroup(mountTable.getGroupName())
> && mode.getOtherAction().implies(access)) {
>   return;
> }
> throw new AccessControlException(
> "Permission denied while accessing mount table "
> + mountTable.getSourcePath()
> + ": user " + getUser() + " does not have " + access.toString()
> + " permissions.");
>   }
> {code}
> I just propose revoke WRITE MountTableEntry privilege to super user only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15051) RBF: Propose to revoke WRITE MountTableEntry privilege to super user only

2020-03-22 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064245#comment-17064245
 ] 

Xiaoqiao He commented on HDFS-15051:


v008 change permission check logic just EXECUTE if immediate parent does't 
exist rather than WRITE permission when add mount point.
[~ayushtkn] Please take another review if you have bandwidth. Thanks.

> RBF: Propose to revoke WRITE MountTableEntry privilege to super user only
> -
>
> Key: HDFS-15051
> URL: https://issues.apache.org/jira/browse/HDFS-15051
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15051.001.patch, HDFS-15051.002.patch, 
> HDFS-15051.003.patch, HDFS-15051.004.patch, HDFS-15051.005.patch, 
> HDFS-15051.006.patch, HDFS-15051.007.patch, HDFS-15051.008.patch
>
>
> The current permission checker of #MountTableStoreImpl is not very restrict. 
> In some case, any user could add/update/remove MountTableEntry without the 
> expected permission checking.
> The following code segment try to check permission when operate 
> MountTableEntry, however mountTable object is from Client/RouterAdmin 
> {{MountTable mountTable = request.getEntry();}}, and user could pass any mode 
> which could bypass the permission checker.
> {code:java}
>   public void checkPermission(MountTable mountTable, FsAction access)
>   throws AccessControlException {
> if (isSuperUser()) {
>   return;
> }
> FsPermission mode = mountTable.getMode();
> if (getUser().equals(mountTable.getOwnerName())
> && mode.getUserAction().implies(access)) {
>   return;
> }
> if (isMemberOfGroup(mountTable.getGroupName())
> && mode.getGroupAction().implies(access)) {
>   return;
> }
> if (!getUser().equals(mountTable.getOwnerName())
> && !isMemberOfGroup(mountTable.getGroupName())
> && mode.getOtherAction().implies(access)) {
>   return;
> }
> throw new AccessControlException(
> "Permission denied while accessing mount table "
> + mountTable.getSourcePath()
> + ": user " + getUser() + " does not have " + access.toString()
> + " permissions.");
>   }
> {code}
> I just propose revoke WRITE MountTableEntry privilege to super user only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12733) Option to disable to namenode local edits

2020-03-22 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064253#comment-17064253
 ] 

Xiaoqiao He commented on HDFS-12733:


Hi [~shv],[~brahmareddy],[~elgoiri],[~ayushtkn], Sorry for pending this issue 
for long time since we do not reach agreement last time, but this issue is 
still be there and sometimes could impact performance. I want to know If we 
could step forward without introducing a new configuration parameter which as 
v008 shows? Thanks everyone again.

> Option to disable to namenode local edits
> -
>
> Key: HDFS-12733
> URL: https://issues.apache.org/jira/browse/HDFS-12733
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode, performance
>Reporter: Brahma Reddy Battula
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-12733-001.patch, HDFS-12733-002.patch, 
> HDFS-12733-003.patch, HDFS-12733.004.patch, HDFS-12733.005.patch, 
> HDFS-12733.006.patch, HDFS-12733.007.patch, HDFS-12733.008.patch
>
>
> As of now, Edits will be written in local and shared locations which will be 
> redundant and local edits never used in HA setup.
> Disabling local edits gives little performance improvement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15113) Missing IBR when NameNode restart if open processCommand async feature

2020-03-22 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064506#comment-17064506
 ] 

Xiaoqiao He commented on HDFS-15113:


Thanks [~weichiu]. Please refer Yetus result: 
https://builds.apache.org/job/PreCommit-HDFS-Build/29007/console

> Missing IBR when NameNode restart if open processCommand async feature
> --
>
> Key: HDFS-15113
> URL: https://issues.apache.org/jira/browse/HDFS-15113
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Blocker
> Fix For: 3.3.0
>
> Attachments: HDFS-15113.001.patch, HDFS-15113.002.patch, 
> HDFS-15113.003.patch, HDFS-15113.004.patch, HDFS-15113.005.patch, 
> HDFS-15113.addendum.patch
>
>
> Recently, I meet one case that NameNode missing block after restart which is 
> related with HDFS-14997.
> a. during NameNode restart, it will return command `DNA_REGISTER` to DataNode 
> when receive some RPC request from DataNode.
> b. when DataNode receive `DNA_REGISTER` command, it will run #reRegister 
> async.
> {code:java}
>   void reRegister() throws IOException {
> if (shouldRun()) {
>   // re-retrieve namespace info to make sure that, if the NN
>   // was restarted, we still match its version (HDFS-2120)
>   NamespaceInfo nsInfo = retrieveNamespaceInfo();
>   // and re-register
>   register(nsInfo);
>   scheduler.scheduleHeartbeat();
>   // HDFS-9917,Standby NN IBR can be very huge if standby namenode is down
>   // for sometime.
>   if (state == HAServiceState.STANDBY || state == 
> HAServiceState.OBSERVER) {
> ibrManager.clearIBRs();
>   }
> }
>   }
> {code}
> c. As we know, #register will trigger BR immediately.
> d. because #reRegister run async, so we could not make sure which one run 
> first between send FBR and clear IBR. If clean IBR run first, it will be OK. 
> But if send FBR first then clear IBR, it will missing some blocks received 
> between these two time point until next FBR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15075) Remove process command timing from BPServiceActor

2020-03-22 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064541#comment-17064541
 ] 

Xiaoqiao He commented on HDFS-15075:


Thanks [~elgoiri] for your good catches, v007 try to update the patch following 
suggestions. Please give another reviews. Thanks again.

> Remove process command timing from BPServiceActor
> -
>
> Key: HDFS-15075
> URL: https://issues.apache.org/jira/browse/HDFS-15075
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15075.001.patch, HDFS-15075.002.patch, 
> HDFS-15075.003.patch, HDFS-15075.004.patch, HDFS-15075.005.patch, 
> HDFS-15075.006.patch, HDFS-15075.007.patch
>
>
> HDFS-14997 moved the command processing into async.
> Right now, we are checking the time to add to a queue.
> We should remove this one and maybe move the timing within the thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15075) Remove process command timing from BPServiceActor

2020-03-22 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15075:
---
Attachment: HDFS-15075.007.patch

> Remove process command timing from BPServiceActor
> -
>
> Key: HDFS-15075
> URL: https://issues.apache.org/jira/browse/HDFS-15075
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15075.001.patch, HDFS-15075.002.patch, 
> HDFS-15075.003.patch, HDFS-15075.004.patch, HDFS-15075.005.patch, 
> HDFS-15075.006.patch, HDFS-15075.007.patch
>
>
> HDFS-14997 moved the command processing into async.
> Right now, we are checking the time to add to a queue.
> We should remove this one and maybe move the timing within the thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15169) RBF: Router FSCK should consider the mount table

2020-03-22 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064569#comment-17064569
 ] 

Xiaoqiao He commented on HDFS-15169:


Attach Jenkins result link 
https://builds.apache.org/job/PreCommit-HDFS-Build/28995/console du to it is 
misbehaving few days.
Hi [~aajisaka],[~elgoiri],[~ayushtkn] Would you like to have a review?

> RBF: Router FSCK should consider the mount table
> 
>
> Key: HDFS-15169
> URL: https://issues.apache.org/jira/browse/HDFS-15169
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Akira Ajisaka
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15169.001.patch
>
>
> HDFS-13989 implemented FSCK to DFSRouter, however, it just redirects the 
> requests to all the active downstream NameNodes for now. The DFSRouter should 
> consider the mount table when redirecting the requests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15082) RBF: Check each component length of destination path when add/update mount entry

2020-03-24 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15082:
---
Attachment: HDFS-15082.003.patch

> RBF: Check each component length of destination path when add/update mount 
> entry
> 
>
> Key: HDFS-15082
> URL: https://issues.apache.org/jira/browse/HDFS-15082
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15082.001.patch, HDFS-15082.002.patch, 
> HDFS-15082.003.patch
>
>
> When add/update mount entry, each component length of destination path could 
> exceed filesystem path component length limit, reference to 
> `dfs.namenode.fs-limits.max-component-length` of NameNode. So we should check 
> each component length of destination path when add/update mount entry at 
> Router side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15082) RBF: Check each component length of destination path when add/update mount entry

2020-03-24 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066381#comment-17066381
 ] 

Xiaoqiao He commented on HDFS-15082:


Thanks [~elgoiri] for picking up this issue.  submit v003 (total same as v002) 
and try to trigger Jenkins.

> RBF: Check each component length of destination path when add/update mount 
> entry
> 
>
> Key: HDFS-15082
> URL: https://issues.apache.org/jira/browse/HDFS-15082
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15082.001.patch, HDFS-15082.002.patch, 
> HDFS-15082.003.patch
>
>
> When add/update mount entry, each component length of destination path could 
> exceed filesystem path component length limit, reference to 
> `dfs.namenode.fs-limits.max-component-length` of NameNode. So we should check 
> each component length of destination path when add/update mount entry at 
> Router side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15075) Remove process command timing from BPServiceActor

2020-03-25 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15075:
---
Attachment: HDFS-15075.008.patch

> Remove process command timing from BPServiceActor
> -
>
> Key: HDFS-15075
> URL: https://issues.apache.org/jira/browse/HDFS-15075
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15075.001.patch, HDFS-15075.002.patch, 
> HDFS-15075.003.patch, HDFS-15075.004.patch, HDFS-15075.005.patch, 
> HDFS-15075.006.patch, HDFS-15075.007.patch, HDFS-15075.008.patch
>
>
> HDFS-14997 moved the command processing into async.
> Right now, we are checking the time to add to a queue.
> We should remove this one and maybe move the timing within the thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15075) Remove process command timing from BPServiceActor

2020-03-25 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066456#comment-17066456
 ] 

Xiaoqiao He commented on HDFS-15075:


Thanks [~weichiu],[~elgoiri] for your suggestions, v008 try to fix about 
metrics in {{BPServiceActor}} only and others change will following by the next 
JIRA. For the other added metrics I think it is different with IO metric for 
each volume since existing metrics both focus on external storage IO 
performance, added metrics focus on hold lock times. IIUC it is also necessary 
for performance. FYI. Thanks again.

> Remove process command timing from BPServiceActor
> -
>
> Key: HDFS-15075
> URL: https://issues.apache.org/jira/browse/HDFS-15075
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15075.001.patch, HDFS-15075.002.patch, 
> HDFS-15075.003.patch, HDFS-15075.004.patch, HDFS-15075.005.patch, 
> HDFS-15075.006.patch, HDFS-15075.007.patch, HDFS-15075.008.patch
>
>
> HDFS-14997 moved the command processing into async.
> Right now, we are checking the time to add to a queue.
> We should remove this one and maybe move the timing within the thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15082) RBF: Check each component length of destination path when add/update mount entry

2020-03-25 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066464#comment-17066464
 ] 

Xiaoqiao He commented on HDFS-15082:


Hi [~elgoiri], I try to test failed unit test {{TestRouterFaultTolerant}} at 
local times, It seems all pass and not related with this changes. Please help 
to double check. Thanks.

> RBF: Check each component length of destination path when add/update mount 
> entry
> 
>
> Key: HDFS-15082
> URL: https://issues.apache.org/jira/browse/HDFS-15082
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15082.001.patch, HDFS-15082.002.patch, 
> HDFS-15082.003.patch
>
>
> When add/update mount entry, each component length of destination path could 
> exceed filesystem path component length limit, reference to 
> `dfs.namenode.fs-limits.max-component-length` of NameNode. So we should check 
> each component length of destination path when add/update mount entry at 
> Router side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15169) RBF: Router FSCK should consider the mount table

2020-03-25 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15169:
---
Attachment: HDFS-15169.002.patch

> RBF: Router FSCK should consider the mount table
> 
>
> Key: HDFS-15169
> URL: https://issues.apache.org/jira/browse/HDFS-15169
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Akira Ajisaka
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15169.001.patch, HDFS-15169.002.patch
>
>
> HDFS-13989 implemented FSCK to DFSRouter, however, it just redirects the 
> requests to all the active downstream NameNodes for now. The DFSRouter should 
> consider the mount table when redirecting the requests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15169) RBF: Router FSCK should consider the mount table

2020-03-25 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066635#comment-17066635
 ] 

Xiaoqiao He commented on HDFS-15169:


Thanks [~elgoiri] for your reviews. v002 add unit test for fsck about 
non-mountpoint path request. Please check if we need check any other cases. 
Thanks.

> RBF: Router FSCK should consider the mount table
> 
>
> Key: HDFS-15169
> URL: https://issues.apache.org/jira/browse/HDFS-15169
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Akira Ajisaka
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15169.001.patch, HDFS-15169.002.patch
>
>
> HDFS-13989 implemented FSCK to DFSRouter, however, it just redirects the 
> requests to all the active downstream NameNodes for now. The DFSRouter should 
> consider the mount table when redirecting the requests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15075) Remove process command timing from BPServiceActor

2020-03-25 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15075:
---
Attachment: HDFS-15075.009.patch

> Remove process command timing from BPServiceActor
> -
>
> Key: HDFS-15075
> URL: https://issues.apache.org/jira/browse/HDFS-15075
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15075.001.patch, HDFS-15075.002.patch, 
> HDFS-15075.003.patch, HDFS-15075.004.patch, HDFS-15075.005.patch, 
> HDFS-15075.006.patch, HDFS-15075.007.patch, HDFS-15075.008.patch, 
> HDFS-15075.009.patch
>
>
> HDFS-14997 moved the command processing into async.
> Right now, we are checking the time to add to a queue.
> We should remove this one and maybe move the timing within the thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15075) Remove process command timing from BPServiceActor

2020-03-25 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066637#comment-17066637
 ] 

Xiaoqiao He commented on HDFS-15075:


update to v009, and fix findbugs.

> Remove process command timing from BPServiceActor
> -
>
> Key: HDFS-15075
> URL: https://issues.apache.org/jira/browse/HDFS-15075
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15075.001.patch, HDFS-15075.002.patch, 
> HDFS-15075.003.patch, HDFS-15075.004.patch, HDFS-15075.005.patch, 
> HDFS-15075.006.patch, HDFS-15075.007.patch, HDFS-15075.008.patch, 
> HDFS-15075.009.patch
>
>
> HDFS-14997 moved the command processing into async.
> Right now, we are checking the time to add to a queue.
> We should remove this one and maybe move the timing within the thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15242) Add metrics for operations hold lock times of FsDatasetImpl

2020-03-25 Thread Xiaoqiao He (Jira)
Xiaoqiao He created HDFS-15242:
--

 Summary: Add metrics for operations hold lock times of 
FsDatasetImpl
 Key: HDFS-15242
 URL: https://issues.apache.org/jira/browse/HDFS-15242
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Reporter: Xiaoqiao He
Assignee: Xiaoqiao He


Some operations of FsDatasetImpl need to hold Lock, and sometimes it costs long 
time to execute since it include IO operation in Lock. I propose to add metrics 
for this operations then it could be more convenient for monitor and dig 
bottleneck.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15242) Add metrics for operations hold lock times of FsDatasetImpl

2020-03-25 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15242:
---
Attachment: HDFS-15242.001.patch
Status: Patch Available  (was: Open)

submit init patch v001  and try to trigger Jenkins.

> Add metrics for operations hold lock times of FsDatasetImpl
> ---
>
> Key: HDFS-15242
> URL: https://issues.apache.org/jira/browse/HDFS-15242
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15242.001.patch
>
>
> Some operations of FsDatasetImpl need to hold Lock, and sometimes it costs 
> long time to execute since it include IO operation in Lock. I propose to add 
> metrics for this operations then it could be more convenient for monitor and 
> dig bottleneck.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15075) Remove process command timing from BPServiceActor

2020-03-25 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1703#comment-1703
 ] 

Xiaoqiao He commented on HDFS-15075:


Hi [~weichiu],[~elgoiri], add metrics for {{FsDatasetImpl}} split from here to 
HDFS-15242. Please give another reviews if have bandwidth. Thanks.

> Remove process command timing from BPServiceActor
> -
>
> Key: HDFS-15075
> URL: https://issues.apache.org/jira/browse/HDFS-15075
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15075.001.patch, HDFS-15075.002.patch, 
> HDFS-15075.003.patch, HDFS-15075.004.patch, HDFS-15075.005.patch, 
> HDFS-15075.006.patch, HDFS-15075.007.patch, HDFS-15075.008.patch, 
> HDFS-15075.009.patch
>
>
> HDFS-14997 moved the command processing into async.
> Right now, we are checking the time to add to a queue.
> We should remove this one and maybe move the timing within the thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15238) RBF:NamenodeHeartbeatService caused memory to grow rapidly

2020-03-25 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066869#comment-17066869
 ] 

Xiaoqiao He commented on HDFS-15238:


Thanks [~xuzq_zander] for your works. Good catch here.
+1 for [^HDFS-15238-002.patch] after fix the typo 'Cachec' as [~elgoiri] 
mentioned above. Thanks.

> RBF:NamenodeHeartbeatService caused memory to grow rapidly
> --
>
> Key: HDFS-15238
> URL: https://issues.apache.org/jira/browse/HDFS-15238
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: xuzq
>Assignee: xuzq
>Priority: Major
> Attachments: HDFS-15238-002.patch, HDFS-15238-trunk-001.patch
>
>
> NamenodeHeartbeatService will get NameNode's HA status every 5s, and created 
> HAServiceProtocol every time.
> When creating HAServiceProtocol, it also will new Configuration.
> Over time, there will be more and more entries for REGISTER in Configuration 
> until fullGc happen. 
> The entry will piles up again, after reaching a certain threshold,  the 
> fullGc is triggered again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15242) Add metrics for operations hold lock times of FsDatasetImpl

2020-03-25 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15242:
---
Attachment: HDFS-15242.002.patch

> Add metrics for operations hold lock times of FsDatasetImpl
> ---
>
> Key: HDFS-15242
> URL: https://issues.apache.org/jira/browse/HDFS-15242
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15242.001.patch, HDFS-15242.002.patch
>
>
> Some operations of FsDatasetImpl need to hold Lock, and sometimes it costs 
> long time to execute since it include IO operation in Lock. I propose to add 
> metrics for this operations then it could be more convenient for monitor and 
> dig bottleneck.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15242) Add metrics for operations hold lock times of FsDatasetImpl

2020-03-25 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17067344#comment-17067344
 ] 

Xiaoqiao He commented on HDFS-15242:


Thanks [~weichiu],[~elgoiri] for your reviews. rebase and upload v002 patch. 
v002 improve recode about #createTemporary which hold write lock twice, I try 
to add them together. Please give another review. Thanks.

> Add metrics for operations hold lock times of FsDatasetImpl
> ---
>
> Key: HDFS-15242
> URL: https://issues.apache.org/jira/browse/HDFS-15242
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15242.001.patch, HDFS-15242.002.patch
>
>
> Some operations of FsDatasetImpl need to hold Lock, and sometimes it costs 
> long time to execute since it include IO operation in Lock. I propose to add 
> metrics for this operations then it could be more convenient for monitor and 
> dig bottleneck.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12733) Option to disable to namenode local edits

2020-03-26 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17067442#comment-17067442
 ] 

Xiaoqiao He commented on HDFS-12733:


Thanks [~elgoiri],[~ayushtkn] for your great feedback. v008 does rely on set 
`dfs.namenode.edits.dir` to blank to disable local edit. Yes, we need more 
information to clear this changes. It is true that their actions are different, 
IMO it is feasible to unify it that we disable local edits if config blank and 
tell our end user explicitly. Of course, patch v008 is not a complete story 
currently. I would like to update the patch if no objection. Thanks again 
[~elgoiri],[~ayushtkn] for your suggestions.

> Option to disable to namenode local edits
> -
>
> Key: HDFS-12733
> URL: https://issues.apache.org/jira/browse/HDFS-12733
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode, performance
>Reporter: Brahma Reddy Battula
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-12733-001.patch, HDFS-12733-002.patch, 
> HDFS-12733-003.patch, HDFS-12733.004.patch, HDFS-12733.005.patch, 
> HDFS-12733.006.patch, HDFS-12733.007.patch, HDFS-12733.008.patch
>
>
> As of now, Edits will be written in local and shared locations which will be 
> redundant and local edits never used in HA setup.
> Disabling local edits gives little performance improvement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15242) Add metrics for operations hold lock times of FsDatasetImpl

2020-03-26 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17067915#comment-17067915
 ] 

Xiaoqiao He commented on HDFS-15242:


Thanks [~elgoiri] for your reviews. v003 fix the typo. PTAL.

> Add metrics for operations hold lock times of FsDatasetImpl
> ---
>
> Key: HDFS-15242
> URL: https://issues.apache.org/jira/browse/HDFS-15242
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15242.001.patch, HDFS-15242.002.patch, 
> HDFS-15242.003.patch
>
>
> Some operations of FsDatasetImpl need to hold Lock, and sometimes it costs 
> long time to execute since it include IO operation in Lock. I propose to add 
> metrics for this operations then it could be more convenient for monitor and 
> dig bottleneck.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



<    1   2   3   4   5   6   7   8   9   10   >