[jira] [Commented] (HDDS-4237) Testing Infrastructure for network partitioning

2020-09-15 Thread Rui Wang (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17196672#comment-17196672
 ] 

Rui Wang commented on HDDS-4237:


https://github.com/apache/hadoop-ozone/blob/676610ef40b0f8701d950b347fb916d28268f1da/hadoop-ozone/fault-injection-test/mini-chaos-tests/src/test/java/org/apache/hadoop/ozone/failure/Failures.java#L55


https://github.com/apache/hadoop-ozone/blob/676610ef40b0f8701d950b347fb916d28268f1da/hadoop-ozone/fault-injection-test/mini-chaos-tests/src/test/java/org/apache/hadoop/ozone/TestMiniChaosOzoneCluster.java#L113

> Testing Infrastructure for network partitioning
> ---
>
> Key: HDDS-4237
> URL: https://issues.apache.org/jira/browse/HDDS-4237
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: Rui Wang
>Priority: Major
>
> Network partitioning can cause brian-split case where there are two leaders 
> exist. We need some sort of testing Infrastructure/framework to simulate such 
> case and verify whether our  SCM HA implementation can achieve strong 
> consistency under partitioned network.
> There might be two ways suggested by Mukul Kumar Singh:
> a) Blockade tests, blockade is a docker based framework where the
> network for one DN can be isolated from the other
> b) MiniOzoneChaosCluster - This is a unit test based test, where a
> random datanode was killed and this helped in finding out issues with
> the consistency.
> We might need similar solution for SCM: block SCM leader network and also 
> increase timeout to make old leader do not turn into candidate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-4237) Testing Infrastructure for network partitioning

2020-09-15 Thread Rui Wang (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17196670#comment-17196670
 ] 

Rui Wang commented on HDDS-4237:


might start from MiniOzoneChaosCluster

> Testing Infrastructure for network partitioning
> ---
>
> Key: HDDS-4237
> URL: https://issues.apache.org/jira/browse/HDDS-4237
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: Rui Wang
>Priority: Major
>
> Network partitioning can cause brian-split case where there are two leaders 
> exist. We need some sort of testing Infrastructure/framework to simulate such 
> case and verify whether our  SCM HA implementation can achieve strong 
> consistency under partitioned network.
> There might be two ways suggested by Mukul Kumar Singh:
> a) Blockade tests, blockade is a docker based framework where the
> network for one DN can be isolated from the other
> b) MiniOzoneChaosCluster - This is a unit test based test, where a
> random datanode was killed and this helped in finding out issues with
> the consistency.
> We might need similar solution for SCM: block SCM leader network and also 
> increase timeout to make old leader do not turn into candidate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-4237) Testing Infrastructure for network partitioning

2020-09-14 Thread Rui Wang (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195579#comment-17195579
 ] 

Rui Wang commented on HDDS-4237:


Other potential ideas:

https://github.com/apache/hadoop-ozone/tree/master/hadoop-ozone/fault-injection-test/network-tests/src/test

https://chaos-mesh.org/

https://jepsen.io/


> Testing Infrastructure for network partitioning
> ---
>
> Key: HDDS-4237
> URL: https://issues.apache.org/jira/browse/HDDS-4237
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: Rui Wang
>Priority: Major
>
> Network partitioning can cause brian-split case where there are two leaders 
> exist. We need some sort of testing Infrastructure/framework to simulate such 
> case and verify whether our  SCM HA implementation can achieve strong 
> consistency under partitioned network.
> There might be two ways suggested by Mukul Kumar Singh:
> a) Blockade tests, blockade is a docker based framework where the
> network for one DN can be isolated from the other
> b) MiniOzoneChaosCluster - This is a unit test based test, where a
> random datanode was killed and this helped in finding out issues with
> the consistency.
> We might need similar solution for SCM: block SCM leader network and also 
> increase timeout to make old leader do not turn into candidate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org