[jira] [Commented] (HDFS-17241) long write lock on active NN from rollEditLog()

ASF GitHub Bot (Jira) Wed, 01 Nov 2023 21:44:48 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-17241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781991#comment-17781991
 ]


ASF GitHub Bot commented on HDFS-17241:
---------------------------------------

Hexiaoqiao commented on PR #6236:
URL: https://github.com/apache/hadoop/pull/6236#issuecomment-1790057643

   > it uploads a fsimage, which may be tens of gigabytes, which will make the 
disk where the ANN metadata is stored very busy
   
   This makes sense to me. I don't know if the case @shuaiqig reported is same 
as mentioned here. If true, I don't think it could be prevented only avoid do 
checkpoint and roll editlog at same time from Standby, because sometimes Active 
could trigger roll editlog by itself. Another side, if not roll editlog in 
time, it will involve more load for Standby when replay it because there will 
overstock much more transaction to process. Right? 
   IMO, we should improve the local storage performance or limit the throughput 
when upload fsimage such as decrease `dfs.image.transfer.chunksize`. FYI.




> long write lock on active NN from rollEditLog()
> -----------------------------------------------
>
>                 Key: HDFS-17241
>                 URL: https://issues.apache.org/jira/browse/HDFS-17241
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 3.1.2
>            Reporter: shuaiqi.guo
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: HDFS-17241.patch
>
>
> when standby NN triggering log roll on active NN and sending fsimage to 
> active NN at the same time, the active NN will have a long write lock, which 
> blocks almost all requests. like:
> {code:java}
> INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem write 
> lock held for 27179 ms via java.lang.Thread.getStackTrace(Thread.java:1559)
> org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:273)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:235)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1617)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:4663)
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:1292)
> org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:146)
> org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:12974)
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)
> java.security.AccessController.doPrivileged(Native Method)
> javax.security.auth.Subject.doAs(Subject.java:422)
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
> org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-17241) long write lock on active NN from rollEditLog()

Reply via email to