[jira] [Issue Comment Deleted] (HDFS-9126) namenode crash in fsimage download/transfer

2021-06-14 Thread Seokchan Yoon (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-9126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seokchan Yoon updated HDFS-9126:

Comment: was deleted

(was: Why is this closed? I ran into the same situation and need to figure out 
the reason why the previous active NN failed on doHealthChecks.)

> namenode crash in fsimage download/transfer
> ---
>
> Key: HDFS-9126
> URL: https://issues.apache.org/jira/browse/HDFS-9126
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.0
> Environment: OS:Centos 6.5(final)
> Apache Hadoop:2.6.0
> namenode ha base 5 journalnodes
>Reporter: zengyongping
>Priority: Critical
>
> In our product Hadoop cluster,when active namenode begin download/transfer 
> fsimage from standby namenode.some times zkfc monitor health of NameNode 
> socket timeout,zkfs judge active namenode status SERVICE_NOT_RESPONDING 
> ,happen hadoop namenode ha failover,fence old active namenode.
> zkfc logs:
> 2015-09-24 11:44:44,739 WARN org.apache.hadoop.ha.HealthMonitor: 
> Transport-level exception trying to monitor health of NameNode at 
> hostname1/192.168.10.11:8020: Call From hostname1/192.168.10.11 to 
> hostname1:8020 failed on socket timeout exception: 
> java.net.SocketTimeoutException: 45000 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/192.168.10.11:22614 remote=hostname1/192.168.10.11:8020]; For more 
> details see:  http://wiki.apache.org/hadoop/SocketTimeout
> 2015-09-24 11:44:44,740 INFO org.apache.hadoop.ha.HealthMonitor: Entering 
> state SERVICE_NOT_RESPONDING
> 2015-09-24 11:44:44,740 INFO org.apache.hadoop.ha.ZKFailoverController: Local 
> service NameNode at hostname1/192.168.10.11:8020 entered state: 
> SERVICE_NOT_RESPONDING
> 2015-09-24 11:44:44,740 INFO org.apache.hadoop.ha.ZKFailoverController: 
> Quitting master election for NameNode at hostname1/192.168.10.11:8020 and 
> marking that fencing is necessary
> 2015-09-24 11:44:44,740 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
> Yielding from election
> 2015-09-24 11:44:44,761 INFO org.apache.zookeeper.ZooKeeper: Session: 
> 0x54d81348fe503e3 closed
> 2015-09-24 11:44:44,761 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Ignoring stale result from old client with sessionId 0x54d81348fe503e3
> 2015-09-24 11:44:44,764 INFO org.apache.zookeeper.ClientCnxn: EventThread 
> shut down
> namenode logs:
> 2015-09-24 11:43:34,074 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 
> 192.168.10.12
> 2015-09-24 11:43:34,074 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Rolling edit logs
> 2015-09-24 11:43:34,075 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 
> 2317430129
> 2015-09-24 11:43:34,253 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 
> 272988 Total time for transactions(ms): 5502 Number of transactions batched 
> in Syncs: 146274 Number of syncs: 32375 SyncTimes(ms): 274465 319599
> 2015-09-24 11:43:46,005 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: 
> Rescanning after 3 milliseconds
> 2015-09-24 11:44:21,054 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReplicationMonitor timed out blk_1185804191_112164210
> 2015-09-24 11:44:36,076 INFO 
> org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits 
> file 
> /software/data/hadoop-data/hdfs/namenode/current/edits_inprogress_02317430129
>  -> 
> /software/data/hadoop-data/hdfs/namenode/current/edits_02317430129-02317703116
> 2015-09-24 11:44:36,077 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 
> 2317703117
> 2015-09-24 11:45:38,008 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 1 
> Total time for transactions(ms): 0 Number of transactions batched in Syncs: 0 
> Number of syncs: 0 SyncTimes(ms): 0 61585
> 2015-09-24 11:45:38,009 INFO 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Transfer took 222.88s 
> at 63510.29 KB/s
> 2015-09-24 11:45:38,009 INFO 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Downloaded file 
> fsimage.ckpt_02317430128 size 14495092105 bytes.
> 2015-09-24 11:45:38,416 WARN 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Remote journal 
> 192.168.10.13:8485 failed to write txns 2317703117-2317703117. Will try to 
> write to this JN again after the next log roll.
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 44 is 
> less than the last promised epoch 45
> at 
> org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:414)
> at 
> org.apache.hadoop.hd

[jira] [Issue Comment Deleted] (HDFS-9126) namenode crash in fsimage download/transfer

2017-07-11 Thread linhaiqiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

linhaiqiang updated HDFS-9126:
--
Comment: was deleted

(was: Similar questions, 2017-07-06 17:22:12 active namenode health monitor 45 
seconds timeout, active namenode exit election, ZK connection closed
However, there is no new log generation for the namenode log from 2017-07-06 
17:20:37 and no exception log
and The GC log looks normal
Through the source code analysis, the abnormal process is as follows
1. The healthMonitor at the active node causes a 45 second timeout during the 
detection of the NN state to cause access to the SERVICE_NOT_RESPONDING state
2. The active node exits the election, the zkClient closes to the ZOOKEEPER 
connection, and the temporary lock znode on the ZK automatically deletes
3. The standby node gets the temporary lock ZK on the znode and starts 
switching itself to active
To prevent the presence of 2 active brain clefts at the same time, the active 
is first converted to standby by the RPC request
4. If the RPC request is blocked, then connect the past kill to the NameNode 
process by way of SSH
5. After success, you change yourself into a active node
But why 45 seconds timeout, because the process was kill, can not find the time 
jstack stack information, there is no way to find)

> namenode crash in fsimage download/transfer
> ---
>
> Key: HDFS-9126
> URL: https://issues.apache.org/jira/browse/HDFS-9126
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.0
> Environment: OS:Centos 6.5(final)
> Apache Hadoop:2.6.0
> namenode ha base 5 journalnodes
>Reporter: zengyongping
>Priority: Critical
>
> In our product Hadoop cluster,when active namenode begin download/transfer 
> fsimage from standby namenode.some times zkfc monitor health of NameNode 
> socket timeout,zkfs judge active namenode status SERVICE_NOT_RESPONDING 
> ,happen hadoop namenode ha failover,fence old active namenode.
> zkfc logs:
> 2015-09-24 11:44:44,739 WARN org.apache.hadoop.ha.HealthMonitor: 
> Transport-level exception trying to monitor health of NameNode at 
> hostname1/192.168.10.11:8020: Call From hostname1/192.168.10.11 to 
> hostname1:8020 failed on socket timeout exception: 
> java.net.SocketTimeoutException: 45000 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/192.168.10.11:22614 remote=hostname1/192.168.10.11:8020]; For more 
> details see:  http://wiki.apache.org/hadoop/SocketTimeout
> 2015-09-24 11:44:44,740 INFO org.apache.hadoop.ha.HealthMonitor: Entering 
> state SERVICE_NOT_RESPONDING
> 2015-09-24 11:44:44,740 INFO org.apache.hadoop.ha.ZKFailoverController: Local 
> service NameNode at hostname1/192.168.10.11:8020 entered state: 
> SERVICE_NOT_RESPONDING
> 2015-09-24 11:44:44,740 INFO org.apache.hadoop.ha.ZKFailoverController: 
> Quitting master election for NameNode at hostname1/192.168.10.11:8020 and 
> marking that fencing is necessary
> 2015-09-24 11:44:44,740 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
> Yielding from election
> 2015-09-24 11:44:44,761 INFO org.apache.zookeeper.ZooKeeper: Session: 
> 0x54d81348fe503e3 closed
> 2015-09-24 11:44:44,761 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Ignoring stale result from old client with sessionId 0x54d81348fe503e3
> 2015-09-24 11:44:44,764 INFO org.apache.zookeeper.ClientCnxn: EventThread 
> shut down
> namenode logs:
> 2015-09-24 11:43:34,074 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 
> 192.168.10.12
> 2015-09-24 11:43:34,074 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Rolling edit logs
> 2015-09-24 11:43:34,075 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 
> 2317430129
> 2015-09-24 11:43:34,253 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 
> 272988 Total time for transactions(ms): 5502 Number of transactions batched 
> in Syncs: 146274 Number of syncs: 32375 SyncTimes(ms): 274465 319599
> 2015-09-24 11:43:46,005 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: 
> Rescanning after 3 milliseconds
> 2015-09-24 11:44:21,054 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReplicationMonitor timed out blk_1185804191_112164210
> 2015-09-24 11:44:36,076 INFO 
> org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits 
> file 
> /software/data/hadoop-data/hdfs/namenode/current/edits_inprogress_02317430129
>  -> 
> /software/data/hadoop-data/hdfs/namenode/current/edits_02317430129-02317703116
> 2015-09-24 11:44:36,077 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 
> 2317703