[
https://issues.apache.org/jira/browse/HDFS-17241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781991#comment-17781991
]
ASF GitHub Bot commented on HDFS-17241:
---------------------------------------
Hexiaoqiao commented on PR #6236:
URL: https://github.com/apache/hadoop/pull/6236#issuecomment-1790057643
> it uploads a fsimage, which may be tens of gigabytes, which will make the
disk where the ANN metadata is stored very busy
This makes sense to me. I don't know if the case @shuaiqig reported is same
as mentioned here. If true, I don't think it could be prevented only avoid do
checkpoint and roll editlog at same time from Standby, because sometimes Active
could trigger roll editlog by itself. Another side, if not roll editlog in
time, it will involve more load for Standby when replay it because there will
overstock much more transaction to process. Right?
IMO, we should improve the local storage performance or limit the throughput
when upload fsimage such as decrease `dfs.image.transfer.chunksize`. FYI.
> long write lock on active NN from rollEditLog()
> -----------------------------------------------
>
> Key: HDFS-17241
> URL: https://issues.apache.org/jira/browse/HDFS-17241
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 3.1.2
> Reporter: shuaiqi.guo
> Priority: Major
> Labels: pull-request-available
> Attachments: HDFS-17241.patch
>
>
> when standby NN triggering log roll on active NN and sending fsimage to
> active NN at the same time, the active NN will have a long write lock, which
> blocks almost all requests. like:
> {code:java}
> INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem write
> lock held for 27179 ms via java.lang.Thread.getStackTrace(Thread.java:1559)
> org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:273)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:235)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1617)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:4663)
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:1292)
> org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:146)
> org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:12974)
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)
> java.security.AccessController.doPrivileged(Native Method)
> javax.security.auth.Subject.doAs(Subject.java:422)
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
> org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]