[ 
https://issues.apache.org/jira/browse/HDDS-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Huang updated HDDS-5263:
-------------------------------
    Summary: SCM may stay in safe mode forever due to incorrect open pipeline 
count  (was: SCM may stay in safe mode forever)

> SCM may stay in safe mode forever due to incorrect open pipeline count
> ----------------------------------------------------------------------
>
>                 Key: HDDS-5263
>                 URL: https://issues.apache.org/jira/browse/HDDS-5263
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: SCM HA
>            Reporter: George Huang
>            Assignee: Bharat Viswanadham
>            Priority: Major
>
> SCM went into safe mode and never come out of it after SCM restart. 
> |INFO|SCMSafeModeManager|SCM in safe mode. Pipelines with at least one 
> datanode reported count is 1, required at least one datanode reported per 
> pipeline count is 6|
> However, at this time, recon shows there are 6 open Ratie(3) pipelines and 10 
> open Ratie(1) pipelines.
>  
> When SCM Started it has 6 pipelines in open state, we read from DB and get 
> this.
> {code:java}
>  783833 2021-05-20 18:00:54,613 INFO 
> org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule: Total 
> pipeline count is 6, pipeline's with at         least one datanode reported 
> threshold count is 6
> {code}
> But once the SCM Ratis server started it will replay logs from 
> Transactioninfo last applied Index, so after that I see all pipelines are 
> removed. (might be due to close pipeline)
> Because this SafeMode rule is not successfully validated, SCM never came out 
> of safe mode.
> https://issues.apache.org/jira/browse/HDDS-4399 This Jira has taken care to 
> consider open pipelines this can work for non-HA, as DB updates immediately 
> written to DB. But in HA, we write to DBTransactionBuffer, so lets say 
> pipelines are closed but not applied to DB. And now SCM is restarted, first 
> PipelineManager is initialized it reads from DB, and get 6 pipeline count, 
> and then SCM replays its transaction which removes them if pipeline close 
> happened before. Because of this SCM safemode rule cannot be successfully 
> validated.
>  
>  783875 2021-05-20 18:00:55,963 INFO 
> org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager: Pipeline Pipeline[ 
> Id: c79a2082-9cac-4bcf-b303-9beaf84e5 998, Nodes: 
> d8f40fc5-ea38-4fd2-a588-4aaf9ac544d6
> {ip: xxxx.xx.xx.xxx, host: xxx.com, ports: [REPLICATION=9886, RATIS=9858, RA 
> TIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: 
> /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpSt 
> ateExpiryEpochSec: 0}
> ea53e24e-3d10-4d41-93c9-a568a1627cca
> {ip: xxx.xx.xx.xxx, host: xxx.com, ports: [REPLICATION=9886, RATIS =9858, 
> RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: 
> /default, certSerialId: null, persistedOpState: IN_SERVICE, pers 
> istedOpStateExpiryEpochSec: 0}
> 9416da18-1fc4-4cb3-8200-6a71698c808e
> {ip: xxx.xx.xx.xxx, host: xxx.com, ports: [REPLICATION=98 86, RATIS=9858, 
> RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: 
> /default, certSerialId: null, persistedOpState: IN_SERV ICE, 
> persistedOpStateExpiryEpochSec: 0}
> , ReplicationConfig: RATIS/THREE, State:CLOSED, 
> leaderId:9416da18-1fc4-4cb3-8200-6a71698c808e, CreationT 
> imestamp2021-05-20T18:00:54.497Z] removed.
> 783882 2021-05-20 18:00:55,970 INFO 
> org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager: Pipeline Pipeline[ 
> Id: e1b21d65-e80f-4ade-8e78-9bd956183 a7c, Nodes: 
> 8fd99eff-7f50-4b56-ad03-1e796030268d
> :
> :
> {ip: xxx.xx.xx.xxx, host: xxx.com, ports: [REPLICATION=98 86, RATIS=9858, 
> RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: 
> /default, certSerialId: null, persistedOpState: IN_SERV ICE, 
> persistedOpStateExpiryEpochSec: 0}
> , ReplicationConfig: RATIS/THREE, State:CLOSED, 
> leaderId:ea53e24e-3d10-4d41-93c9-a568a1627cca, CreationT 
> imestamp2021-05-20T18:00:54.497Z] removed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to