[
https://issues.apache.org/jira/browse/HDDS-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
George Huang updated HDDS-5263:
-------------------------------
Summary: SCM may stay in safe mode forever due to incorrect open pipeline
count (was: SCM may stay in safe mode forever)
> SCM may stay in safe mode forever due to incorrect open pipeline count
> ----------------------------------------------------------------------
>
> Key: HDDS-5263
> URL: https://issues.apache.org/jira/browse/HDDS-5263
> Project: Apache Ozone
> Issue Type: Bug
> Components: SCM HA
> Reporter: George Huang
> Assignee: Bharat Viswanadham
> Priority: Major
>
> SCM went into safe mode and never come out of it after SCM restart.
> |INFO|SCMSafeModeManager|SCM in safe mode. Pipelines with at least one
> datanode reported count is 1, required at least one datanode reported per
> pipeline count is 6|
> However, at this time, recon shows there are 6 open Ratie(3) pipelines and 10
> open Ratie(1) pipelines.
>
> When SCM Started it has 6 pipelines in open state, we read from DB and get
> this.
> {code:java}
> 783833 2021-05-20 18:00:54,613 INFO
> org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule: Total
> pipeline count is 6, pipeline's with at least one datanode reported
> threshold count is 6
> {code}
> But once the SCM Ratis server started it will replay logs from
> Transactioninfo last applied Index, so after that I see all pipelines are
> removed. (might be due to close pipeline)
> Because this SafeMode rule is not successfully validated, SCM never came out
> of safe mode.
> https://issues.apache.org/jira/browse/HDDS-4399 This Jira has taken care to
> consider open pipelines this can work for non-HA, as DB updates immediately
> written to DB. But in HA, we write to DBTransactionBuffer, so lets say
> pipelines are closed but not applied to DB. And now SCM is restarted, first
> PipelineManager is initialized it reads from DB, and get 6 pipeline count,
> and then SCM replays its transaction which removes them if pipeline close
> happened before. Because of this SCM safemode rule cannot be successfully
> validated.
>
> 783875 2021-05-20 18:00:55,963 INFO
> org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager: Pipeline Pipeline[
> Id: c79a2082-9cac-4bcf-b303-9beaf84e5 998, Nodes:
> d8f40fc5-ea38-4fd2-a588-4aaf9ac544d6
> {ip: xxxx.xx.xx.xxx, host: xxx.com, ports: [REPLICATION=9886, RATIS=9858, RA
> TIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation:
> /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpSt
> ateExpiryEpochSec: 0}
> ea53e24e-3d10-4d41-93c9-a568a1627cca
> {ip: xxx.xx.xx.xxx, host: xxx.com, ports: [REPLICATION=9886, RATIS =9858,
> RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation:
> /default, certSerialId: null, persistedOpState: IN_SERVICE, pers
> istedOpStateExpiryEpochSec: 0}
> 9416da18-1fc4-4cb3-8200-6a71698c808e
> {ip: xxx.xx.xx.xxx, host: xxx.com, ports: [REPLICATION=98 86, RATIS=9858,
> RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation:
> /default, certSerialId: null, persistedOpState: IN_SERV ICE,
> persistedOpStateExpiryEpochSec: 0}
> , ReplicationConfig: RATIS/THREE, State:CLOSED,
> leaderId:9416da18-1fc4-4cb3-8200-6a71698c808e, CreationT
> imestamp2021-05-20T18:00:54.497Z] removed.
> 783882 2021-05-20 18:00:55,970 INFO
> org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager: Pipeline Pipeline[
> Id: e1b21d65-e80f-4ade-8e78-9bd956183 a7c, Nodes:
> 8fd99eff-7f50-4b56-ad03-1e796030268d
> :
> :
> {ip: xxx.xx.xx.xxx, host: xxx.com, ports: [REPLICATION=98 86, RATIS=9858,
> RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation:
> /default, certSerialId: null, persistedOpState: IN_SERV ICE,
> persistedOpStateExpiryEpochSec: 0}
> , ReplicationConfig: RATIS/THREE, State:CLOSED,
> leaderId:ea53e24e-3d10-4d41-93c9-a568a1627cca, CreationT
> imestamp2021-05-20T18:00:54.497Z] removed.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]