[ 
https://issues.apache.org/jira/browse/HDDS-5249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347687#comment-17347687
 ] 

Stephen O'Donnell commented on HDDS-5249:
-----------------------------------------

This issue is quite tricky to fully fix. There are two issues:

1. Parallel processing of ICR and FCR can lead to data inconsistency between 
the ComtainerManager and NodeManager. This is what caused the bug above.

2. A FCR wiping out a reference to a container recently sent in an ICR, but 
which is not included in the FCR.

The second issue is less serious, as the next FCR will fix the problem, as the 
FCRs are produced approximately every 60 seconds by default.

We can fix problem 1 quite easily by synchronising on the datanode when 
processing FCRs and ICRs, that will ensure the data inconsistency will not 
happen:

1. ICR runs first and adds container #1001.

2. Then FCR runs and does not have #1001. It will see it in SCM for the node 
and remove it.

3. At this point, we have lost the reference to #1001 in SCM, but on the new 
FCR it will be included and put back.

The other way around, there is no problem:

1. FCR runs first, #1001 is not present.

2. ICR runs, and adds #1001 and all is good.

> Race Condition between Full and Incremental Container Reports
> -------------------------------------------------------------
>
>                 Key: HDDS-5249
>                 URL: https://issues.apache.org/jira/browse/HDDS-5249
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: SCM
>    Affects Versions: 1.1.0
>            Reporter: Stephen O'Donnell
>            Assignee: Stephen O'Donnell
>            Priority: Major
>
> During testing we came across an issue with ICR and FCR handing.
> The following log shows the issue:
> {code}
> 2021-05-18 13:14:15,394 DEBUG 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Processing 
> replica of container #1 from datanode 
> 945aa180-5cff-4298-a8ad-8197542e4562{ip: 172.27.108.136, host: 
> quasar-nqdywv-7.quasar-nqdywv.root.hwx.site, ports: [REPLICATION=9886, 
> RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], 
> networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, 
> persistedOpStateExpiryEpochSec: 0}
> 2021-05-18 13:14:15,394 DEBUG 
> org.apache.hadoop.hdds.scm.container.IncrementalContainerReportHandler: 
> Processing replica of container #1001 from datanode 
> 945aa180-5cff-4298-a8ad-8197542e4562{ip: 172.27.108.136, host: 
> quasar-nqdywv-7.quasar-nqdywv.root.hwx.site, ports: [REPLICATION=9886, 
> RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], 
> networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, 
> persistedOpStateExpiryEpochSec: 0}
> 2021-05-18 13:14:15,394 DEBUG 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Processing 
> replica of container #2 from datanode 
> 945aa180-5cff-4298-a8ad-8197542e4562{ip: 172.27.108.136, host: 
> quasar-nqdywv-7.quasar-nqdywv.root.hwx.site, ports: [REPLICATION=9886, 
> RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], 
> networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, 
> persistedOpStateExpiryEpochSec: 0}
> 2021-05-18 13:14:15,394 DEBUG 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Processing 
> replica of container #3 from datanode 
> 945aa180-5cff-4298-a8ad-8197542e4562{ip: 172.27.108.136, host: 
> quasar-nqdywv-7.quasar-nqdywv.root.hwx.site, ports: [REPLICATION=9886, 
> RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], 
> networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, 
> persistedOpStateExpiryEpochSec: 0}
> 2021-05-18 13:14:15,394 DEBUG 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Processing 
> replica of container #4 from datanode 
> 945aa180-5cff-4298-a8ad-8197542e4562{ip: 172.27.108.136, host: 
> quasar-nqdywv-7.quasar-nqdywv.root.hwx.site, ports: [REPLICATION=9886, 
> RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], 
> networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, 
> persistedOpStateExpiryEpochSec: 0}
> ...
> {code}
> In the above log, SCM is processing both an ICR and FCR for the same Datanode 
> at the same time. The FCR does not container container #1001.
> The FCR starts first, and it takes a snapshot of the containers on the node 
> via NodeManager.
> Then it starts processing the containers one by one.
> The ICR then starts, and it added #1001 to the ContainerManager and to the 
> NodeManager.
> When the FCR completes, it replaces the list of containers in NodeManager 
> with those in the FCR.
> At this point, container #1001 is in the ContainerManager, but it is not 
> listed against the node in NodeManager.
> This would get fixed by the next FCR, but then the node goes dead. The dead 
> node handler runs and uses the list of containers in NodeManager to remove 
> all containers for the node. As #1001 is not listed, it is not removed by the 
> DeadNodeManager. This means the container will never been seen as under 
> replicated, as 3 copies will exist forever in the ContainerManager.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to