[ https://issues.apache.org/jira/browse/HDFS-17241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17783910#comment-17783910 ]
ASF GitHub Bot commented on HDFS-17241: --------------------------------------- shuaiqig commented on PR #6236: URL: https://github.com/apache/hadoop/pull/6236#issuecomment-1801256224 > > it uploads a fsimage, which may be tens of gigabytes, which will make the disk where the ANN metadata is stored very busy > > This makes sense to me. I don't know if the case @shuaiqig reported is same as mentioned here. If true, I don't think it could be prevented only avoid do checkpoint and roll editlog at same time from Standby, because sometimes Active could trigger roll editlog by itself. Another side, if not roll editlog in time, it will involve more load for Standby when replay it because there will overstock much more transaction to process. Right? IMO, we should improve the local storage performance or limit the throughput when upload fsimage such as decrease `dfs.image.transfer.chunksize`. FYI. @Hexiaoqiao Yes, you are right. As I mentioned in jira, this PR is not the best way to solve this problem, it is just the fastest and most convenient way that I can think of temporarily. And thank you for your review. > long write lock on active NN from rollEditLog() > ----------------------------------------------- > > Key: HDFS-17241 > URL: https://issues.apache.org/jira/browse/HDFS-17241 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 3.1.2 > Reporter: shuaiqi.guo > Priority: Major > Labels: pull-request-available > Attachments: HDFS-17241.patch > > > when standby NN triggering log roll on active NN and sending fsimage to > active NN at the same time, the active NN will have a long write lock, which > blocks almost all requests. like: > {code:java} > INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem write > lock held for 27179 ms via java.lang.Thread.getStackTrace(Thread.java:1559) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:273) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:235) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1617) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:4663) > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:1292) > org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:146) > org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:12974) > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872) > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818) > java.security.AccessController.doPrivileged(Native Method) > javax.security.auth.Subject.doAs(Subject.java:422) > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org