[ 
https://issues.apache.org/jira/browse/HDFS-17241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17783910#comment-17783910
 ] 

ASF GitHub Bot commented on HDFS-17241:
---------------------------------------

shuaiqig commented on PR #6236:
URL: https://github.com/apache/hadoop/pull/6236#issuecomment-1801256224

   > > it uploads a fsimage, which may be tens of gigabytes, which will make 
the disk where the ANN metadata is stored very busy
   > 
   > This makes sense to me. I don't know if the case @shuaiqig reported is 
same as mentioned here. If true, I don't think it could be prevented only avoid 
do checkpoint and roll editlog at same time from Standby, because sometimes 
Active could trigger roll editlog by itself. Another side, if not roll editlog 
in time, it will involve more load for Standby when replay it because there 
will overstock much more transaction to process. Right? IMO, we should improve 
the local storage performance or limit the throughput when upload fsimage such 
as decrease `dfs.image.transfer.chunksize`. FYI.
   
   @Hexiaoqiao Yes, you are right. As I mentioned in jira, this PR is not the 
best way to solve this problem, it is just the fastest and most convenient way 
that I can think of temporarily. And thank you for your review.
   




> long write lock on active NN from rollEditLog()
> -----------------------------------------------
>
>                 Key: HDFS-17241
>                 URL: https://issues.apache.org/jira/browse/HDFS-17241
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 3.1.2
>            Reporter: shuaiqi.guo
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: HDFS-17241.patch
>
>
> when standby NN triggering log roll on active NN and sending fsimage to 
> active NN at the same time, the active NN will have a long write lock, which 
> blocks almost all requests. like:
> {code:java}
> INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem write 
> lock held for 27179 ms via java.lang.Thread.getStackTrace(Thread.java:1559)
> org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:273)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:235)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1617)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:4663)
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:1292)
> org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:146)
> org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:12974)
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)
> java.security.AccessController.doPrivileged(Native Method)
> javax.security.auth.Subject.doAs(Subject.java:422)
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
> org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to