[ https://issues.apache.org/jira/browse/HDFS-17241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791877#comment-17791877 ]
ASF GitHub Bot commented on HDFS-17241: --------------------------------------- shuaiqig commented on PR #6236: URL: https://github.com/apache/hadoop/pull/6236#issuecomment-1835368742 > > When rollEditLog() is called, ANN writes to seen_txid in both the dfs.namenode.name.dir and the dfs.namenode.edits.dir (regardless of whether they are isolated or not), using a write lock . If the ioutil is high, it will take a long time to write the small file seen_txid, so indirectly cause ANN to hold the write lock for a long time. > > Back to this PR. For HA-mode cluster, if we set the same storage device for both `dfs.namenode.name.dir` and `dfs.namenode.edits.dir`, it could lead high load of this storage, especially for large cluster and impact performance of ANN. [HDFS-12733](https://issues.apache.org/jira/browse/HDFS-12733) try to disable local edit for HA-mode with shared edit dirs, which proposed years ago. (NOTE: this is draft patch and not checkin to trunk, could not checkin smoothy now, it need to review carefully if reference.) Hope it could solve this issue. Thanks. I have set different path for `dfs.namenode.name.dir` and `dfs.namenode.edits.dir`, but they are still on the same storage device. I will try to set different storage devices for them later. Thanks for your help. > long write lock on active NN from rollEditLog() > ----------------------------------------------- > > Key: HDFS-17241 > URL: https://issues.apache.org/jira/browse/HDFS-17241 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 3.1.2 > Reporter: shuaiqi.guo > Priority: Major > Labels: pull-request-available > Attachments: HDFS-17241.patch > > > when standby NN triggering log roll on active NN and sending fsimage to > active NN at the same time, the active NN will have a long write lock, which > blocks almost all requests. like: > {code:java} > INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem write > lock held for 27179 ms via java.lang.Thread.getStackTrace(Thread.java:1559) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:273) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:235) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1617) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:4663) > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:1292) > org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:146) > org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:12974) > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872) > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818) > java.security.AccessController.doPrivileged(Native Method) > javax.security.auth.Subject.doAs(Subject.java:422) > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org