[ 
https://issues.apache.org/jira/browse/HDFS-13609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277637#comment-17277637
 ] 

xuzq edited comment on HDFS-13609 at 2/3/21, 2:18 AM:
------------------------------------------------------

Thanks [~xkrogen] for the comment. It is when {{onlyDurableTxns}} is true that 
we get {{responseCounts.get(0)}}

In our production  environment,  one nameNode is down when we failover it to 
active, and cache one exception like:

 
{code:java}
2021-02-01 20:38:23,402 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode 
IPC Server handler 227 on 8022: Error encountered requiring N
N shutdown. Shutting down immediately.
java.lang.IllegalStateException: Cannot start writing at txid 58504771317 when 
there is a stream available for read: org.apache.hadoop.hdfs
.server.namenode.RedundantEditLogInputStream@57d3ac44
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:324)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1417)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1969)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:64)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:58)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1826)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1658)
        at 
org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslator
PB.java:111)
        at 
org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:54
09)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:620)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1125)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:3246)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:3242)
        at 
java.base/java.security.AccessController.doPrivileged(AccessController.java:689)
        at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3240)
{code}
 After looking at the code, i think _editLogTailer.catchupDuringFailover()_ 
can't catchup all edits, cause check failed when 
_getFSImage().editLog.openForWrite()_.

As when _{{onlyDurableTxns}}_ is true that we get {{_responseCounts.get(0)_, 
caused }}{{_editLogTailer.catchupDuringFailover()_ can't catchup all edits, 
because one journal is wrong when write edit on disk after write it into cache 
successfully, and this wrong journal's response is 
}}{{_responseCounts.get(0)._}}{{}}{{}}

 
{quote}Thus since we only got 3 responses, we have to take the lowest txn that 
any of those responses are aware of.
{quote} * It maybe cause *_editLogTailer.catchupDuringFailover()_ can't catchup 
all edits* when _maxAllowedTxns={{responseCounts.get(0)=0.}}_
 * And It maybe cause doTailEdits can't tail any edits too.

 

 


was (Author: xuzq_zander):
Thanks [~xkrogen] for the comment. It is when {{onlyDurableTxns}} is true that 
we get {{responseCounts.get(0)}}

In our production  environment,  one nameNode is down when we failover it to 
active, and cache one exception like:

 
{code:java}
2021-02-01 20:38:23,402 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode 
IPC Server handler 227 on 8022: Error encountered requiring N
N shutdown. Shutting down immediately.
java.lang.IllegalStateException: Cannot start writing at txid 58504771317 when 
there is a stream available for read: org.apache.hadoop.hdfs
.server.namenode.RedundantEditLogInputStream@57d3ac44
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:324)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1417)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1969)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:64)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:58)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1826)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1658)
        at 
org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslator
PB.java:111)
        at 
org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:54
09)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:620)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1125)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:3246)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:3242)
        at 
java.base/java.security.AccessController.doPrivileged(AccessController.java:689)
        at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3240)
{code}
 After looking at the code, i think _editLogTailer.catchupDuringFailover()_ 
can't catchup all edits, cause check failed when 
_getFSImage().editLog.openForWrite()_.

As when _{{onlyDurableTxns}}_ is true that we get {{_responseCounts.get(0)_, 
caused }}{{_editLogTailer.catchupDuringFailover()_ can't catchup all edits, 
because one journal is wrong when write edit on disk after write it into cache 
successfully, and this wrong journal's response is 
}}{{_responseCounts.get(0)._}}{{}}{{}}

 
{quote}Thus since we only got 3 responses, we have to take the lowest txn that 
any of those responses are aware of.
{quote} * It maybe cause *_editLogTailer.catchupDuringFailover()_ can't catchup 
all edits* when _maxAllowedTxns={{responseCounts.get(0)=0.}}_
 * And It maybe cause doTailEdits can't tail any edits too.

 

 

> [Edit Tail Fast Path Pt 3] NameNode-side changes to support tailing edits via 
> RPC
> ---------------------------------------------------------------------------------
>
>                 Key: HDFS-13609
>                 URL: https://issues.apache.org/jira/browse/HDFS-13609
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: ha, namenode
>            Reporter: Erik Krogen
>            Assignee: Erik Krogen
>            Priority: Major
>             Fix For: HDFS-12943, 3.3.0
>
>         Attachments: HDFS-13609-HDFS-12943.000.patch, 
> HDFS-13609-HDFS-12943.001.patch, HDFS-13609-HDFS-12943.002.patch, 
> HDFS-13609-HDFS-12943.003.patch, HDFS-13609-HDFS-12943.004.patch
>
>
> See HDFS-13150 for the full design.
> This JIRA is targetted at the NameNode-side changes to enable tailing 
> in-progress edits via the RPC mechanism added in HDFS-13608. Most changes are 
> in the QuorumJournalManager.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to